date:20140707

[PATCH] btrfs-progs: Add mount point output for 'btrfs fi df' command.

2014-07-07 Thread Qu Wenruo

Add mount point output for 'btrfs fi df'.
Also since the patch uses find_mount_root() to find mount point,
now 'btrfs fi df' can output more meaningful error message when given a
non-btrfs path.

Signed-off-by: Qu Wenruo 
---
This patch needs to be merged after the following path:
btrfs-progs: Check fstype in find_mount_root()
---
 cmds-filesystem.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 4b2d27e..d571765 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -187,12 +187,22 @@ static int cmd_filesystem_df(int argc, char **argv)
int ret;
int fd;
char *path;
+   char *mount_point = NULL;
DIR *dirstream = NULL;
 
if (check_argc_exact(argc, 2))
usage(cmd_filesystem_df_usage);
 
path = argv[1];
+   ret = find_mount_root(path, &mount_point);
+   if (ret < 0) {
+  if (ret != -ENOENT)
+  fprintf(stderr, "ERROR: Failed to find mount root for 
path %s: %s\n",
+  path, strerror(-ret));
+   return 1;
+   }
+   printf("Mounted on: %s\n", mount_point);
+   free(mount_point);
 
fd = open_file_or_dir(path, &dirstream);
if (fd < 0) {
-- 
2.0.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] btrfs: code optimize use btrfs_get_bdev_and_sb() at btrfs_scan_one_device

2014-07-07 Thread Miao Xie

On Tue, 8 Jul 2014 12:08:19 +0800, Liu Bo wrote:
> On Tue, Jul 08, 2014 at 02:38:37AM +0800, Anand Jain wrote:
>> (for review comments pls).
>>
>> btrfs_scan_one_device() needs SB, instead of doing it from scratch could
>> use btrfs_get_bdev_and_sb()
>>
>> Signed-off-by: Anand Jain 
>> ---
>>  fs/btrfs/volumes.c | 51 ++-
>>  1 file changed, 6 insertions(+), 45 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index c166355..94e6131 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1053,14 +1053,11 @@ int btrfs_scan_one_device(const char *path, fmode_t 
>> flags, void *holder,
>>  {
>>  struct btrfs_super_block *disk_super;
>>  struct block_device *bdev;
>> -struct page *page;
>> -void *p;
>>  int ret = -EINVAL;
>>  u64 devid;
>>  u64 transid;
>>  u64 total_devices;
>> -u64 bytenr;
>> -pgoff_t index;
>> +struct buffer_head *bh;
>>  
>>  /*
>>   * we would like to check all the supers, but that would make
>> @@ -1068,44 +1065,12 @@ int btrfs_scan_one_device(const char *path, fmode_t 
>> flags, void *holder,
>>   * So, we need to add a special mount option to scan for
>>   * later supers, using BTRFS_SUPER_MIRROR_MAX instead
>>   */
>> -bytenr = btrfs_sb_offset(0);
>>  mutex_lock(&uuid_mutex);
>>  
>> -bdev = blkdev_get_by_path(path, flags, holder);
>> -
>> -if (IS_ERR(bdev)) {
>> -ret = PTR_ERR(bdev);
>> +ret = btrfs_get_bdev_and_sb(path, flags, holder, 0, &bdev, &bh);
>> +if (ret)
>>  goto error;
>> -}
>> -
>> -/* make sure our super fits in the device */
>> -if (bytenr + PAGE_CACHE_SIZE >= i_size_read(bdev->bd_inode))
>> -goto error_bdev_put;
>> -
>> -/* make sure our super fits in the page */
>> -if (sizeof(*disk_super) > PAGE_CACHE_SIZE)
>> -goto error_bdev_put;
>> -
>> -/* make sure our super doesn't straddle pages on disk */
>> -index = bytenr >> PAGE_CACHE_SHIFT;
>> -if ((bytenr + sizeof(*disk_super) - 1) >> PAGE_CACHE_SHIFT != index)
>> -goto error_bdev_put;
> 
> Apparently btrfs_get_bdev_and_sb() lacks the above two checks, otherwise 
> looks good.

In fact, our disk_super size is constant and <= min page size (4K), and we are 
sure that
it is impossible that the super block is cross the block, so the above two 
checks are unnecessary.

Thanks
Miao

> 
> thanks,
> -liubo
> 
>> -
>> -/* pull in the page with our super */
>> -page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
>> -   index, GFP_NOFS);
>> -
>> -if (IS_ERR_OR_NULL(page))
>> -goto error_bdev_put;
>> -
>> -p = kmap(page);
>> -
>> -/* align our pointer to the offset of the super block */
>> -disk_super = p + (bytenr & ~PAGE_CACHE_MASK);
>> -
>> -if (btrfs_super_bytenr(disk_super) != bytenr ||
>> -btrfs_super_magic(disk_super) != BTRFS_MAGIC)
>> -goto error_unmap;
>> +disk_super = (struct btrfs_super_block *) bh->b_data;
>>  
>>  devid = btrfs_stack_device_id(&disk_super->dev_item);
>>  transid = btrfs_super_generation(disk_super);
>> @@ -1125,13 +1090,9 @@ int btrfs_scan_one_device(const char *path, fmode_t 
>> flags, void *holder,
>>  printk(KERN_CONT "devid %llu transid %llu %s\n", devid, 
>> transid, path);
>>  }
>>  
>> -
>> -error_unmap:
>> -kunmap(page);
>> -page_cache_release(page);
>> -
>> -error_bdev_put:
>> +brelse(bh);
>>  blkdev_put(bdev, flags);
>> +
>>  error:
>>  mutex_unlock(&uuid_mutex);
>>  return ret;
>> -- 
>> 2.0.0.257.g75cc6c6
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] btrfs: code optimize use btrfs_get_bdev_and_sb() at btrfs_scan_one_device

2014-07-07 Thread Liu Bo

On Tue, Jul 08, 2014 at 02:38:37AM +0800, Anand Jain wrote:
> (for review comments pls).
> 
> btrfs_scan_one_device() needs SB, instead of doing it from scratch could
> use btrfs_get_bdev_and_sb()
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/volumes.c | 51 ++-
>  1 file changed, 6 insertions(+), 45 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c166355..94e6131 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1053,14 +1053,11 @@ int btrfs_scan_one_device(const char *path, fmode_t 
> flags, void *holder,
>  {
>   struct btrfs_super_block *disk_super;
>   struct block_device *bdev;
> - struct page *page;
> - void *p;
>   int ret = -EINVAL;
>   u64 devid;
>   u64 transid;
>   u64 total_devices;
> - u64 bytenr;
> - pgoff_t index;
> + struct buffer_head *bh;
>  
>   /*
>* we would like to check all the supers, but that would make
> @@ -1068,44 +1065,12 @@ int btrfs_scan_one_device(const char *path, fmode_t 
> flags, void *holder,
>* So, we need to add a special mount option to scan for
>* later supers, using BTRFS_SUPER_MIRROR_MAX instead
>*/
> - bytenr = btrfs_sb_offset(0);
>   mutex_lock(&uuid_mutex);
>  
> - bdev = blkdev_get_by_path(path, flags, holder);
> -
> - if (IS_ERR(bdev)) {
> - ret = PTR_ERR(bdev);
> + ret = btrfs_get_bdev_and_sb(path, flags, holder, 0, &bdev, &bh);
> + if (ret)
>   goto error;
> - }
> -
> - /* make sure our super fits in the device */
> - if (bytenr + PAGE_CACHE_SIZE >= i_size_read(bdev->bd_inode))
> - goto error_bdev_put;
> -
> - /* make sure our super fits in the page */
> - if (sizeof(*disk_super) > PAGE_CACHE_SIZE)
> - goto error_bdev_put;
> -
> - /* make sure our super doesn't straddle pages on disk */
> - index = bytenr >> PAGE_CACHE_SHIFT;
> - if ((bytenr + sizeof(*disk_super) - 1) >> PAGE_CACHE_SHIFT != index)
> - goto error_bdev_put;

Apparently btrfs_get_bdev_and_sb() lacks the above two checks, otherwise looks 
good.

thanks,
-liubo

> -
> - /* pull in the page with our super */
> - page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
> -index, GFP_NOFS);
> -
> - if (IS_ERR_OR_NULL(page))
> - goto error_bdev_put;
> -
> - p = kmap(page);
> -
> - /* align our pointer to the offset of the super block */
> - disk_super = p + (bytenr & ~PAGE_CACHE_MASK);
> -
> - if (btrfs_super_bytenr(disk_super) != bytenr ||
> - btrfs_super_magic(disk_super) != BTRFS_MAGIC)
> - goto error_unmap;
> + disk_super = (struct btrfs_super_block *) bh->b_data;
>  
>   devid = btrfs_stack_device_id(&disk_super->dev_item);
>   transid = btrfs_super_generation(disk_super);
> @@ -1125,13 +1090,9 @@ int btrfs_scan_one_device(const char *path, fmode_t 
> flags, void *holder,
>   printk(KERN_CONT "devid %llu transid %llu %s\n", devid, 
> transid, path);
>   }
>  
> -
> -error_unmap:
> - kunmap(page);
> - page_cache_release(page);
> -
> -error_bdev_put:
> + brelse(bh);
>   blkdev_put(bdev, flags);
> +
>  error:
>   mutex_unlock(&uuid_mutex);
>   return ret;
> -- 
> 2.0.0.257.g75cc6c6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Revert "btrfs: allow mounting btrfs subvolumes with different ro/rw options"

2014-07-07 Thread Goffredo Baroncelli

On 07/08/2014 04:43 AM, Duncan wrote:
> The remaining problem to deal with is that if say the root subvol (id=5) 
> is mounted rw,subvolmode=rw, while a subvolume below it is mounted 
> subvolmode=ro, then what happens if someone tries to make an edit in the 
> portion of the filesystem visible in the subvolume, but from the parent, 
> id=5/root in this case?  Obviously if that modification is allowed from 
> the parent, it'll change what's visible in the child subvolume as well, 
> which would be rather unexpected.

The ro/rw status is a subvolume flag.
So if a subvolume is marked rw (or ro), is writable (not writable) in all
the mount(S)
This flag is not inheritable.

What could be strange is the following:

  # mount -o subvolid=5,rw /dev/sda1 /mnt/btrfs-root
  # btrfs subvol create /mnt/btrfs-root/subvolname/
then
  # touch /mnt/btrfs-root/subvolname/touch-file

succeeds; but

  # mount -o subvolid=5,rw /dev/sda1 /mnt/btrfs-root
  # btrfs subvol create /mnt/btrfs-root/subvolname/
  # mount -o subvol=subvolname,ro /dev/sda1 /mnt/btrfs-subvol
then
  # touch /mnt/btrfs-root/subvolname/touch-file2

fails.




-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] btrfs: code optimize use btrfs_get_bdev_and_sb() at btrfs_scan_one_device

2014-07-07 Thread Miao Xie

On Tue, 8 Jul 2014 02:38:37 +0800, Anand Jain wrote:
> (for review comments pls).
> 
> btrfs_scan_one_device() needs SB, instead of doing it from scratch could
> use btrfs_get_bdev_and_sb()
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/volumes.c | 51 ++-
>  1 file changed, 6 insertions(+), 45 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c166355..94e6131 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1053,14 +1053,11 @@ int btrfs_scan_one_device(const char *path, fmode_t 
> flags, void *holder,
>  {
>   struct btrfs_super_block *disk_super;
>   struct block_device *bdev;
> - struct page *page;
> - void *p;
>   int ret = -EINVAL;
>   u64 devid;
>   u64 transid;
>   u64 total_devices;
> - u64 bytenr;
> - pgoff_t index;
> + struct buffer_head *bh;
>  
>   /*
>* we would like to check all the supers, but that would make
> @@ -1068,44 +1065,12 @@ int btrfs_scan_one_device(const char *path, fmode_t 
> flags, void *holder,
>* So, we need to add a special mount option to scan for
>* later supers, using BTRFS_SUPER_MIRROR_MAX instead
>*/
> - bytenr = btrfs_sb_offset(0);
>   mutex_lock(&uuid_mutex);
>  
> - bdev = blkdev_get_by_path(path, flags, holder);
> -
> - if (IS_ERR(bdev)) {
> - ret = PTR_ERR(bdev);
> + ret = btrfs_get_bdev_and_sb(path, flags, holder, 0, &bdev, &bh);
> + if (ret)
>   goto error;
> - }
> -
> - /* make sure our super fits in the device */
> - if (bytenr + PAGE_CACHE_SIZE >= i_size_read(bdev->bd_inode))
> - goto error_bdev_put;

I think moving this check into btrfs_get_bdev_and_sb is better. The other is OK.

Thanks
Miao

> -
> - /* make sure our super fits in the page */
> - if (sizeof(*disk_super) > PAGE_CACHE_SIZE)
> - goto error_bdev_put;
> -
> - /* make sure our super doesn't straddle pages on disk */
> - index = bytenr >> PAGE_CACHE_SHIFT;
> - if ((bytenr + sizeof(*disk_super) - 1) >> PAGE_CACHE_SHIFT != index)
> - goto error_bdev_put;
> -
> - /* pull in the page with our super */
> - page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
> -index, GFP_NOFS);
> -
> - if (IS_ERR_OR_NULL(page))
> - goto error_bdev_put;
> -
> - p = kmap(page);
> -
> - /* align our pointer to the offset of the super block */
> - disk_super = p + (bytenr & ~PAGE_CACHE_MASK);
> -
> - if (btrfs_super_bytenr(disk_super) != bytenr ||
> - btrfs_super_magic(disk_super) != BTRFS_MAGIC)
> - goto error_unmap;
> + disk_super = (struct btrfs_super_block *) bh->b_data;
>  
>   devid = btrfs_stack_device_id(&disk_super->dev_item);
>   transid = btrfs_super_generation(disk_super);
> @@ -1125,13 +1090,9 @@ int btrfs_scan_one_device(const char *path, fmode_t 
> flags, void *holder,
>   printk(KERN_CONT "devid %llu transid %llu %s\n", devid, 
> transid, path);
>   }
>  
> -
> -error_unmap:
> - kunmap(page);
> - page_cache_release(page);
> -
> -error_bdev_put:
> + brelse(bh);
>   blkdev_put(bdev, flags);
> +
>  error:
>   mutex_unlock(&uuid_mutex);
>   return ret;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Revert "btrfs: allow mounting btrfs subvolumes with different ro/rw options"

2014-07-07 Thread Duncan

Goffredo Baroncelli posted on Mon, 07 Jul 2014 19:37:53 +0200 as
excerpted:

> For "mounted RO" I mean the VFS flag, the "one" passed via the mount
> command. I say "one" as 1, because I am convinced that it has to act
> globally,
> e.g. on the whole filesystem; the flag should be set at the first mount,
> then it can be changed (only ?) issuing a "mount -o remount,rw/ro"
[...]
> So for each filesystem, there is a "globally" flag ro/rw which acts on
> the whole filesystem. Clear and simple.
> 
> Step 2: a more fine grained control of the subvolumes.
> We have already the capability to make a subvolume read-only/read-write
> doing
> 
># btrfs property set -t s /path/to/subvolume ro true
> 
> or
> 
># btrfs property set -t s /path/to/subvolume ro false
> 
> My idea is to use this flag. It could be done at the mount time for
> example:
> 
>   # mount -o subvolmode=ro,subvol=subvolname /dev/sda1 /
> 
> (this example doesn't work, it is only a my idea)
> 
> So:
> - we should not add further code
> - the semantic is simple
> - the property is linked to the subvolume in a understandable way
> 
> We should only add the subvolmode=ro option to the mount command.
> 
> Further discussion need to investigate the following cases:
> - if the filesystem is mounted as ro (mount -o ro), does mounting a
> subvolume rw ( mount -o subvolmode=rw...) should raise an error ? 
> (IMHO yes)
> - if the filesystem is mounted as ro (mount -o ro), does mounting
> the filesystem a 2nd time rw ( mount -o rw...) should raise an error ? 
> (IMHO yes)
> - if a subvolume is mounter rw (or ro), does mounting the same subvolume
> a 2nd time as ro (or rw) should raise an error ?
> (IMHO yes)

Makes sense.

Assuming I'm following you correctly, then, no subvolumes could be rw if 
the filesystem/vfs flag was set ro.

Which would mean that in ordered to mount any particular subvolume rw, 
the whole filesystem would have to be rw.

Extending now:

For simplicity and backward compatibility, if subvolmode isn't set, it 
corresponds to the whole-fs/vfs mode.  That way, setting mount -o ro,... 
(or -o rw,...) with the first mount would naturally propagate to all 
subsequent subvolume mounts, unless of course (1) all subvolumes and the 
filesystem itself are umounted, after which a new mount would be the 
first one again, or (2) a mount -o remount,... is done that changes the 
whole-fs mode.

Further, if it happened that one wanted the first subvolume mounted to be 
ro, but the filesystem as a whole rw so that other subvolumes could be 
mounted rw, the following would accomplish that:

mount -o rw,subvolmode=ro

That way, the subvol would be ro as desired, but the filesystem as a 
whole would be rw, so other subvolumes could be successfully mounted rw.

I like the concept. =:^)

The remaining problem to deal with is that if say the root subvol (id=5) 
is mounted rw,subvolmode=rw, while a subvolume below it is mounted 
subvolmode=ro, then what happens if someone tries to make an edit in the 
portion of the filesystem visible in the subvolume, but from the parent, 
id=5/root in this case?  Obviously if that modification is allowed from 
the parent, it'll change what's visible in the child subvolume as well, 
which would be rather unexpected.

I'd suggest that the snapshotting border rule should apply to writes as 
well.  Snapshots stop at subvolume borders, and writes should as well.  
Attempting to write in a child subvolume should error out -- child 
subvolumes are not part of a parent snapshot and shouldn't be writable 
from the parent subvolume, either.  Child-subvolume content should be 
read-only because it's beyond the subvolume border.

That would seem to be the safest.  Altho I believe it's a change from 
current behavior, where it's possible to write into any subvolume visible 
from the parent (that is, not covered by an over-mount, perhaps even of 
the same subvolume that would otherwise be visible in the same location 
from the parent subvolume), provided the parent is writable.

Regardless, my biggest take-away from the discussion so far is that I'm 
glad I decided to go with entirely separate filesystems, each on their 
own partitions, so my ro vs writable mounts do exactly what I expect them 
to do without me having to worry or think about it too much!  That wasn't 
the reason I did it -- I did it because I didn't want all my data eggs in 
the same whole-filesystem basket such that if a filesystem was damaged, 
the damage was compartmentalized -- but now that see all the subvolume rw/
ro implications discussed in this thread, I'm VERY glad I personally 
don't have to worry about it, and it all simply "just works" for me, 
because each filesystem is independent of the others, not simply a 
subvolume!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe

Re: [PATCH V2 7/9] btrfs: fix null pointer dereference in clone_fs_devices when name is null

2014-07-07 Thread Miao Xie

On Mon, 7 Jul 2014 17:56:13 +0800, Anand Jain wrote:
> 
> 
> On 07/07/2014 12:22, Miao Xie wrote:
>> On Mon, 7 Jul 2014 12:04:09 +0800, Anand Jain wrote:
 when one of the device path is missing btrfs_device name is null. So this
 patch will check for that.

 stack:
 BUG: unable to handle kernel NULL pointer dereference at 0010
 IP: [] strlen+0x0/0x30
 [] ? clone_fs_devices+0xaa/0x160 [btrfs]
 [] btrfs_init_new_device+0x317/0xca0 [btrfs]
 [] ? __kmalloc_track_caller+0x15a/0x1a0
 [] btrfs_ioctl+0xaa3/0x2860 [btrfs]
 [] ? handle_mm_fault+0x48c/0x9c0
 [] ? __blkdev_put+0x171/0x180
 [] ? __do_page_fault+0x4ac/0x590
 [] ? blkdev_put+0x106/0x110
 [] ? mntput+0x35/0x40
 [] do_vfs_ioctl+0x460/0x4a0
 [] ? fput+0xe/0x10
 [] ? task_work_run+0xb3/0xd0
 [] SyS_ioctl+0x57/0x90
 [] ? do_page_fault+0xe/0x10
 [] system_call_fastpath+0x16/0x1b

 reproducer:
 mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
 btrfstune -S 1 /dev/sdg1
 modprobe -r btrfs && modprobe btrfs
 mount -o degraded /dev/sdg1 /btrfs
 btrfs dev add /dev/sdg3 /btrfs

 Signed-off-by: Anand Jain 
 Signed-off-by: Miao Xie 
 ---
 Changelog v1->v2:
 - Fix the problem that we forgot to set the missing flag for the cloned 
 device
 ---
fs/btrfs/volumes.c | 25 -
1 file changed, 16 insertions(+), 9 deletions(-)

 diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
 index 1891541..4731bd6 100644
 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -598,16 +598,23 @@ static struct btrfs_fs_devices 
 *clone_fs_devices(struct btrfs_fs_devices *orig)
if (IS_ERR(device))
goto error;

 -/*
 - * This is ok to do without rcu read locked because we hold the
 - * uuid mutex so nothing we touch in here is going to disappear.
 - */
 -name = rcu_string_strdup(orig_dev->name->str, GFP_NOFS);
 -if (!name) {
 -kfree(device);
 -goto error;
 +if (orig_dev->missing) {
 +device->missing = 1;
 +fs_devices->missing_devices++;
>>>
>>>   as mentioned in some places we just check name (for missing device)
>>>   and  don't set the missing flag so it better to ..
>>>
>>>   if (orig_dev->missing || !orig_dev->name) {
>>>  device->missing = 1;
>>>  fs_devices->missing_devices++;
>>
>> I don't think we need check name pointer here because only missing device
>> doesn't have its own name. Or there is something wrong in the code, so
>> I add assert in else branch. Am I right?
> 
> At few critical code, the below and I guess in the chunk/strips
> function as well, we don't make use of missing flag, but rather
> ->name.
> 
> -
> btrfsic_process_superblock
> ::
> if (!device->bdev || !device->name)
> continue;
> -
> 
>  But here without !orig_dev->name check, is also good enough.

Right.
According to the code, only missing device doesn't have its own name,
that is we can check the device is a missing device or not by missing flag
or its name pointer. Maybe we can remove missing flag, check the device
just by its name pointer(In order to make the code be more readable, maybe
we need introduce a function to wrap the missing device check)

Thanks
Miao

> 
> Thanks, Anand
> 
> 
 +} else {
 +ASSERT(orig_dev->name);
 +/*
 + * This is ok to do without rcu read locked because
 + * we hold the uuid mutex so nothing we touch in here
 + * is going to disappear.
 + */
 +name = rcu_string_strdup(orig_dev->name->str, GFP_NOFS);
 +if (!name) {
 +kfree(device);
 +goto error;
 +}
 +rcu_assign_pointer(device->name, name);
}
 -rcu_assign_pointer(device->name, name);

list_add(&device->dev_list, &fs_devices->devices);
device->fs_devices = fs_devices;

>>>
>>> Thanks, Anand
>>> .
>>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs-progs: Add mount point check for 'btrfs fi df' command

2014-07-07 Thread Qu Wenruo

 Original Message 
Subject: Re: [PATCH 2/2] btrfs-progs: Add mount point check for 'btrfs 
fi df' command

From: Vikram Goyal 
To: linux-btrfs@vger.kernel.org
Date: 2014年07月07日 17:51

On Fri, Jul 04, 2014 at 03:52:26PM +0200, David Sterba wrote:

On Fri, Jul 04, 2014 at 04:38:49PM +0800, Qu Wenruo wrote:

'btrfs fi df' command is currently able to be executed on any file/dir
inside btrfs since it uses btrfs ioctl to get disk usage info.

However it is somewhat confusing for some end users since normally such
command should only be executed on a mount point.

I disagree here, it's much more convenient to run 'fi df' anywhere and
get the output. The system 'df' command works the same way.

Just to clarify, in case my earlier mail did not convey the idea
properly.

The basic difference between traditional df & btrfs fi df is that
traditional df does not errors out when no arg is given &  outputs all the mounted FSes with their mount points. So to be
consistent, btrfs fi df should output all BTRFSes with mount points if
no arg is given.

Btrfs fi df insists for an arg  but does not clarifies in its
output if the given arg is a path inside of a mount point or is the
mount point itself, which can become transparent, if the mount point is
also shown in the output.

IMO this is much better.

Cc David.
What about this idea? No extra warning but output the mount point?
Since if calling find_mount_root(), it will check whether the mount 
point is btrfs,

which can provide more meaningful error message than the original
"ERROR: couldn't get space info - Inappropriate ioctl for device" error 
message.

Thanks,
Qu

This is a just a request & a pointer to an oversight/anomaly but if the
developers do not feel in resonance with it right now then I just wish
that they keep it in mind, think about it & remove this confusion caused
by btrfs fi df as,when & how they feel fit.

The 'fi df' command itself is not that user friendly and the numbers
need further interpretation. I'm using it heavily during debugging and
restricting it to the mountpoint seems too artifical, the tool can cope
with that.

The 'fi usage' is supposed to give the user-friendly overview, but the
patchset is stuck because I found the numbers wrong or misleading under
some circumstances.

I'll reread the thread that motivated this patch to see if there's
something to address.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mount time of multi-disk arrays

2014-07-07 Thread Konstantinos Skarlatos


On 7/7/2014 5:24 μμ, André-Sebastian Liebe wrote:

On 07/07/2014 03:54 PM, Konstantinos Skarlatos wrote:

On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:

Hello List,

can anyone tell me how much time is acceptable and assumable for a
multi-disk btrfs array with classical hard disk drives to mount?

I'm having a bit of trouble with my current systemd setup, because it
couldn't mount my btrfs raid anymore after adding the 5th drive. With
the 4 drive setup it failed to mount once in a few times. Now it fails
everytime because the default timeout of 1m 30s is reached and mount is
aborted.
My last 10 manual mounts took between 1m57s and 2m12s to finish.

I have the exact same problem, and have to manually mount my large
multi-disk btrfs filesystems, so I would be interested in a solution
as well.

Hi Konstantinos , you can workaround this by manual creating a systemd
mount unit.

- First review the autogenerated systemd mount unit (systemctl show
.mount).  You you can get the unit name by issuing a
'systemctl' and look for your failed mount.
- Then you have to take the needed values (After, Before, Conflicts,
RequiresMountsFor, Where, What, Options, Type, Wantedby) and put them
into an new systemd mount unit file (possibly under
/usr/lib/systemd/system/.mount ).
- Now just add the TimeoutSec with a large enough value below [Mount].
- If you later want to automount you raid, add the WantedBy under [Install]
- now issue a 'systemctl daemon-reload' and look for error messages in
syslog.
- If there are no errors you could enable your manual mount entry by
'systemctl enable .mount' and safely comment out your
old fstab entry (systemd no longer generates autogenerated units).

-- 8< --- 8< --- 8< --- 8< --- 8<
--- 8< --- 8< ---
[Unit]
Description=Mount /data/pool0
After=dev-disk-by\x2duuid-066141c6\x2d16ca\x2d4a30\x2db55c\x2de606b90ad0fb.device
systemd-journald.socket local-fs-pre.target system.slice -.mount
Before=umount.target
Conflicts=umount.target
RequiresMountsFor=/data
/dev/disk/by-uuid/066141c6-16ca-4a30-b55c-e606b90ad0fb

[Mount]
Where=/data/pool0
What=/dev/disk/by-uuid/066141c6-16ca-4a30-b55c-e606b90ad0fb
Options=rw,relatime,skip_balance,compress
Type=btrfs
TimeoutSec=3min

[Install]
WantedBy=dev-disk-by\x2duuid-066141c6\x2d16ca\x2d4a30\x2db55c\x2de606b90ad0fb.device
-- 8< --- 8< --- 8< --- 8< --- 8<
--- 8< --- 8< ---


Hi André,
This unit file works for me, thank you for creating it! Can somebody put 
it on the wiki?









My hardware setup contains a
- Intel Core i7 4770
- Kernel 3.15.2-1-ARCH
- 32GB RAM
- dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
- dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)

Thanks in advance

André-Sebastian Liebe
--


# btrfs fi sh
Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
  Total devices 5 FS bytes used 14.21TiB
  devid1 size 3.64TiB used 2.86TiB path /dev/sdd
  devid2 size 3.64TiB used 2.86TiB path /dev/sdc
  devid3 size 3.64TiB used 2.86TiB path /dev/sdf
  devid4 size 3.64TiB used 2.86TiB path /dev/sde
  devid5 size 3.64TiB used 2.88TiB path /dev/sdb

Btrfs v3.14.2-dirty

# btrfs fi df /data/pool0/
Data, single: total=14.28TiB, used=14.19TiB
System, RAID1: total=8.00MiB, used=1.54MiB
Metadata, RAID1: total=26.00GiB, used=20.20GiB
unknown, single: total=512.00MiB, used=0.00


--
To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Konstantinos Skarlatos

--
André-Sebastian Liebe




--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mount time of multi-disk arrays

2014-07-07 Thread Konstantinos Skarlatos


On 7/7/2014 6:48 μμ, Duncan wrote:

Konstantinos Skarlatos posted on Mon, 07 Jul 2014 16:54:05 +0300 as
excerpted:


On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:

can anyone tell me how much time is acceptable and assumable for a
multi-disk btrfs array with classical hard disk drives to mount?

I'm having a bit of trouble with my current systemd setup, because it
couldn't mount my btrfs raid anymore after adding the 5th drive. With
the 4 drive setup it failed to mount once in a few times. Now it fails
everytime because the default timeout of 1m 30s is reached and mount is
aborted.
My last 10 manual mounts took between 1m57s and 2m12s to finish.

I have the exact same problem, and have to manually mount my large
multi-disk btrfs filesystems, so I would be interested in a solution as
well.

I don't have a direct answer, as my btrfs devices are all SSD, but...

a) Btrfs, like some other filesystems, is designed not to need a
pre-mount (or pre-rw-mount) fsck, because it does what /should/ be a
quick-scan at mount-time.  However, that isn't always as quick as it
might be for a number of reasons:

a1) Btrfs is still a relatively immature filesystem and certain
operations are not yet optimized.  In particular, multi-device btrfs
operations tend to still be using a first-working-implementation type of
algorithm instead of a well optimized for parallel operation algorithm,
and thus often serialize access to multiple devices where a more
optimized algorithm would parallelize operations across multiple devices
at the same time.  That will come, but it's not there yet.

a2) Certain operations such as orphan cleanup ("orphans" are files that
were deleted while they were in use and thus weren't fully deleted at the
time; if they were still in use at unmount (remount-read-only), cleanup
is done at mount-time) can delay mount as well.

a3) Inode_cache mount option:  Don't use this unless you can explain
exactly WHY you are using it, preferably backed up with benchmark
numbers, etc.  It's useful only on 32-bit, generally high-file-activity
server systems and has general-case problems, including long mount times
and possible overflow issues that make it inappropriate for normal use.
Unfortunately there's a lot of people out there using it that shouldn't
be, and I even saw it listed on at least one distro (not mine!) wiki. =:^(

a4) The space_cache mount option OTOH *IS* appropriate for normal use
(and is in fact enabled by default these days), but particularly in
improper shutdown cases can require rebuilding at mount time -- altho
this should happen /after/ mount, the system will just be busy for some
minutes, until the space-cache is rebuilt.  But the IO from a space_cache
rebuild on one filesystem could slow down the mounting of filesystems
that mount after it, as well as the boot-time launching of other post-
mount launched services.

If you're seeing the time go up dramatically with the addition of more
filesystem devices, however, and you do /not/ have inode_cache active,
I'd guess it's mainly the not-yet-optimized multi-device operations.


b) As with any systemd launched unit, however, there's systemd
configuration mechanisms for working around specific unit issues,
including timeout issues.  Of course most systems continue to use fstab
and let systemd auto-generate the mount units, and in fact that is
recommended, but either with fstab or directly created mount units,
there's a timeout configuration option that can be set.

b1) The general systemd *.mount unit [Mount] section option appears to be
TimeoutSec=.  As is usual with systemd times, the default is seconds, or
pass the unit(s, like "5min 20s").

b2) I don't see it /specifically/ stated, but with a bit of reading
between the lines, the corresponding fstab option appears to be either
x-systemd.timeoutsec= or x-systemd.TimeoutSec= (IOW I'm not sure of the
case).  You may also want to try x-systemd.device-timeout=, which /is/
specifically mentioned, altho that appears to be specifically the timeout
for the device to appear, NOT for the filesystem to mount after it does.

b3) See the systemd.mount (5) and systemd-fstab-generator (8) manpages
for more, that being what the above is based on.
Thanks for your detailed answer. A mount unit with a larger timeout 
works fine, maybe we should tell distro maintainers to up the limit for 
btrfs to 5 minutes or so?


In my experience, mount time definitely grows as the filesystem grows 
older, and times out after snapshot count gets more than 500-1000 . I 
guess thats something that can be optimized in the future, but i believe 
stability is a much more urgent need now...




So it might take a bit of experimentation to find the exact command, but
based on the above anyway, it /should/ be pretty easy to tell systemd to
wait a bit longer for that filesystem.

When you find the right invocation, please reply with it here, as I'm
sure there's others who will benefit as well.  FWIW, I'm still on
reiserfs for my spinning

Re: [v3.10.y][v3.11.y][v3.12.y][v3.13.y][v3.14.y][PATCH 1/1][V2] ALSA: usb-audio: Prevent printk ratelimiting from spamming kernel log while DEBUG not defined

2014-07-07 Thread Greg KH

On Sat, Jun 21, 2014 at 12:48:27PM -0700, Greg KH wrote:
> On Sat, Jun 21, 2014 at 01:05:53PM +0100, Ben Hutchings wrote:
> > On Fri, 2014-06-20 at 14:21 -0400, Joseph Salisbury wrote:
> > [...]
> > > I looked at this some more.   It seems like my v2 backport may be the
> > > most suitable for the releases mentioned in the subject line, but I'd
> > > like to get additional feedback.
> > > 
> > > The lines added by commit a5065eb just get removed by commit b7a77235. 
> > > Also, if I apply commit a5065eb, it will also require a backport to pull
> > > in just a piece of code(Remove snd_printk() and add dev_dbg()) from
> > > another prior commit(0ba41d9).  No backport would be needed at all if I
> > > cherry-pick 0ba41d9, but that commit seems to have too may changes for a
> > > stable release.
> > 
> > Keep the changes squashed together if you like, but do include both
> > commit hashes and commit messages.
> 
> No, I don't want to see "squashed together" patches, please keep them as
> close to the original patch as possible.  It saves time in the long run,
> trust me...

And since no one did this work for me, I had to do it myself...

{grumble}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: __btrfs_mod_ref should always use no_quota

2014-07-07 Thread Mark Fasheh

From: Josef Bacik 

Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there.  With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always.  Thanks,

Signed-off-by: Josef Bacik 
Signed-off-by: Mark Fasheh 
---
 fs/btrfs/ctree.c   | 20 ++--
 fs/btrfs/ctree.h   |  4 ++--
 fs/btrfs/extent-tree.c | 24 +++-
 3 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d99d965..d9e0ce0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -280,9 +280,9 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
 
WARN_ON(btrfs_header_generation(buf) > trans->transid);
if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
-   ret = btrfs_inc_ref(trans, root, cow, 1, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 1);
else
-   ret = btrfs_inc_ref(trans, root, cow, 0, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 0);
 
if (ret)
return ret;
@@ -1035,14 +1035,14 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
if ((owner == root->root_key.objectid ||
 root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) &&
!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)) {
-   ret = btrfs_inc_ref(trans, root, buf, 1, 1);
+   ret = btrfs_inc_ref(trans, root, buf, 1);
BUG_ON(ret); /* -ENOMEM */
 
if (root->root_key.objectid ==
BTRFS_TREE_RELOC_OBJECTID) {
-   ret = btrfs_dec_ref(trans, root, buf, 0, 1);
+   ret = btrfs_dec_ref(trans, root, buf, 0);
BUG_ON(ret); /* -ENOMEM */
-   ret = btrfs_inc_ref(trans, root, cow, 1, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 1);
BUG_ON(ret); /* -ENOMEM */
}
new_flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
@@ -1050,9 +1050,9 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
 
if (root->root_key.objectid ==
BTRFS_TREE_RELOC_OBJECTID)
-   ret = btrfs_inc_ref(trans, root, cow, 1, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 1);
else
-   ret = btrfs_inc_ref(trans, root, cow, 0, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 0);
BUG_ON(ret); /* -ENOMEM */
}
if (new_flags != 0) {
@@ -1069,11 +1069,11 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
if (flags & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
if (root->root_key.objectid ==
BTRFS_TREE_RELOC_OBJECTID)
-   ret = btrfs_inc_ref(trans, root, cow, 1, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 1);
else
-   ret = btrfs_inc_ref(trans, root, cow, 0, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 0);
BUG_ON(ret); /* -ENOMEM */
-   ret = btrfs_dec_ref(trans, root, buf, 1, 1);
+   ret = btrfs_dec_ref(trans, root, buf, 1);
BUG_ON(ret); /* -ENOMEM */
}
clean_tree_block(trans, root, buf);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4896d7a..56f280f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3307,9 +3307,9 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 
num_bytes,
 u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 struct btrfs_key *ins, int is_data);
 int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
- struct extent_buffer *buf, int full_backref, int no_quota);
+ struct extent_buffer *buf, int full_backref);
 int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
- struct extent_buffer *buf, int full_backref, int no_quota);
+ struct extent_buffer *buf, int full_backref);
 int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 flags,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tr

btrfs: add trace for qgroup accounting

2014-07-07 Thread Mark Fasheh

We want this to debug qgroup changes on live systems.

Signed-off-by: Mark Fasheh 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/qgroup.c|  3 +++
 fs/btrfs/super.c |  1 +
 include/trace/events/btrfs.h | 56 
 3 files changed, 60 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index cf5aead..a9f0f05 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1290,6 +1290,7 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle 
*trans,
oper->seq = atomic_inc_return(&fs_info->qgroup_op_seq);
INIT_LIST_HEAD(&oper->elem.list);
oper->elem.seq = 0;
+   trace_btrfs_qgroup_record_ref(oper);
ret = insert_qgroup_oper(fs_info, oper);
if (ret) {
/* Shouldn't happen so have an assert for developers */
@@ -1909,6 +1910,8 @@ static int btrfs_qgroup_account(struct btrfs_trans_handle 
*trans,
 
ASSERT(is_fstree(oper->ref_root));
 
+   trace_btrfs_qgroup_account(oper);
+
switch (oper->type) {
case BTRFS_QGROUP_OPER_ADD_EXCL:
case BTRFS_QGROUP_OPER_SUB_EXCL:
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4662d92..ca7836c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -60,6 +60,7 @@
 #include "backref.h"
 #include "tests/btrfs-tests.h"
 
+#include "qgroup.h"
 #define CREATE_TRACE_POINTS
 #include 
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 4ee4e30..b8774b3 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -23,6 +23,7 @@ struct map_lookup;
 struct extent_buffer;
 struct btrfs_work;
 struct __btrfs_workqueue;
+struct btrfs_qgroup_operation;
 
 #define show_ref_type(type)\
__print_symbolic(type,  \
@@ -1119,6 +1120,61 @@ DEFINE_EVENT(btrfs__workqueue_done, 
btrfs_workqueue_destroy,
TP_ARGS(wq)
 );
 
+#define show_oper_type(type)   \
+   __print_symbolic(type,  \
+   { BTRFS_QGROUP_OPER_ADD_EXCL,   "OPER_ADD_EXCL" },  \
+   { BTRFS_QGROUP_OPER_ADD_SHARED, "OPER_ADD_SHARED" },\
+   { BTRFS_QGROUP_OPER_SUB_EXCL,   "OPER_SUB_EXCL" },  \
+   { BTRFS_QGROUP_OPER_SUB_SHARED, "OPER_SUB_SHARED" })
+
+DECLARE_EVENT_CLASS(btrfs_qgroup_oper,
+
+   TP_PROTO(struct btrfs_qgroup_operation *oper),
+
+   TP_ARGS(oper),
+
+   TP_STRUCT__entry(
+   __field(u64,  ref_root  )
+   __field(u64,  bytenr)
+   __field(u64,  num_bytes )
+   __field(u64,  seq   )
+   __field(int,  type  )
+   __field(u64,  elem_seq  )
+   ),
+
+   TP_fast_assign(
+   __entry->ref_root   = oper->ref_root;
+   __entry->bytenr = oper->bytenr,
+   __entry->num_bytes  = oper->num_bytes;
+   __entry->seq= oper->seq;
+   __entry->type   = oper->type;
+   __entry->elem_seq   = oper->elem.seq;
+   ),
+
+   TP_printk("ref_root = %llu, bytenr = %llu, num_bytes = %llu, "
+ "seq = %llu, elem.seq = %llu, type = %s",
+ (unsigned long long)__entry->ref_root,
+ (unsigned long long)__entry->bytenr,
+ (unsigned long long)__entry->num_bytes,
+ (unsigned long long)__entry->seq,
+ (unsigned long long)__entry->elem_seq,
+ show_oper_type(__entry->type))
+);
+
+DEFINE_EVENT(btrfs_qgroup_oper, btrfs_qgroup_account,
+
+   TP_PROTO(struct btrfs_qgroup_operation *oper),
+
+   TP_ARGS(oper)
+);
+
+DEFINE_EVENT(btrfs_qgroup_oper, btrfs_qgroup_record_ref,
+
+   TP_PROTO(struct btrfs_qgroup_operation *oper),
+
+   TP_ARGS(oper)
+);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
1.8.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] btrfs: qgroup fixes for btrfs_drop_snapshot V3

2014-07-07 Thread Mark Fasheh

Hi, the following patches try to fix a long outstanding issue with qgroups
and snapshot deletion. The core problem is that btrfs_drop_snapshot will
skip shared extents during it's tree walk. This results in an inconsistent
qgroup state once the drop is processed.

The first patch adds some tracing which I found very useful in debugging
qgroup operations. The second patch is an actual fix to the problem. A third
patch, from Josef is also added. We need this because it fixes at least one
set of inconsistencies qgroups can get to via drop_snapshot.

With this version of the patch series, I can no longer reproduce
qgroup inconsistencies via drop_snapshot on my test disks.

Changes from last patch set:

- search on bytenr and root, but not seq in btrfs_record_ref when
  we're looking for existing qgroup operations.

Changes before that (V1-V2):

- remove extra extent_buffer_uptodate call from account_shared_subtree()

- catch return values for the accounting calls now and do the right thing
  (log an error and tell the user to rescan)

- remove the loop on roots in qgroup_subtree_accounting and just use the
  nnodes member to make our first decision.

- Don't queue up the subtree root for a change (the code in drop_snapshot
  handkles qgroup updates for this block).

- only walk subtrees if we're actually in DROP_REFERENCE stage and we're
  going to call free_extent

- account leaf items for level zero blocks that we are dropping in
  walk_up_proc


General qgroups TODO:

- We need an xfstest for the drop_snapshot case, otherwise I'm
  concerned that we can easily regress from bugs introduced via
  seemingly unrelated patches. This stuff can be fragile.
   - I already have a script that creates and removes a level 1 tree to
 introduce an inconsistency. I think adapting that is probably a good
 first step. The script can be found at:

 http://zeniv.linux.org.uk/~mfasheh/create-btrfs-trees.sh

 Please don't make fun of my poor shell scripting skills  :)

- qgroup items are not deleted after drop_snapshot. They stay orphaned, on
  disk, often with nonzero values in their count fields. This is something
  for another patch. Josef and I have some ideas for how to deal with this:
   - Just zero them out at the end of drop_snapshot (maybe in the future we
 could actually then delete them from disk?)
   - update btrfs_subtree_accounting() to remove bytes from the
 being-deleted qgroups so they wind up as zero on disk (this is
 preferable but might not be practical)

- we need at least a rescan to be kicked off when adding parent qgroups.
  otherwise, the newly added groups start with the wrong information. Quite
  possible the rescan itself might need to be updated (I haven't tested this
  enough).

- qgroup heirarchies in general don't seem quite implemented yet. Once we
  fix the previous items the code to update their counts for them will
  probably need some love.

Please review, thanks. Diffstat follows,
--Mark

 fs/btrfs/ctree.c |   20 +--
 fs/btrfs/ctree.h |4 
 fs/btrfs/extent-tree.c   |  285 +--
 fs/btrfs/qgroup.c|  168 +
 fs/btrfs/qgroup.h|1 
 fs/btrfs/super.c |1 
 include/trace/events/btrfs.h |   57 
 7 files changed, 511 insertions(+), 25 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs: qgroup: account shared subtrees during snapshot delete

2014-07-07 Thread Mark Fasheh

During its tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.

In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.

This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.

Signed-off-by: Mark Fasheh 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c   | 261 +++
 fs/btrfs/qgroup.c| 165 +++
 fs/btrfs/qgroup.h|   1 +
 include/trace/events/btrfs.h |   3 +-
 4 files changed, 429 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 46f39bf..3f43e9a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7324,6 +7324,220 @@ reada:
wc->reada_slot = slot;
 }
 
+static int account_leaf_items(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct extent_buffer *eb)
+{
+   int nr = btrfs_header_nritems(eb);
+   int i, extent_type, ret;
+   struct btrfs_key key;
+   struct btrfs_file_extent_item *fi;
+   u64 bytenr, num_bytes;
+
+   for (i = 0; i < nr; i++) {
+   btrfs_item_key_to_cpu(eb, &key, i);
+
+   if (key.type != BTRFS_EXTENT_DATA_KEY)
+   continue;
+
+   fi = btrfs_item_ptr(eb, i, struct btrfs_file_extent_item);
+   /* filter out non qgroup-accountable extents  */
+   extent_type = btrfs_file_extent_type(eb, fi);
+
+   if (extent_type == BTRFS_FILE_EXTENT_INLINE)
+   continue;
+
+   bytenr = btrfs_file_extent_disk_bytenr(eb, fi);
+   if (!bytenr)
+   continue;
+
+   num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
+
+   ret = btrfs_qgroup_record_ref(trans, root->fs_info,
+ root->objectid,
+ bytenr, num_bytes,
+ BTRFS_QGROUP_OPER_SUB_SUBTREE, 0);
+   if (ret)
+   return ret;
+   }
+   return 0;
+}
+
+/*
+ * Walk up the tree from the bottom, freeing leaves and any interior
+ * nodes which have had all slots visited. If a node (leaf or
+ * interior) is freed, the node above it will have it's slot
+ * incremented. The root node will never be freed.
+ *
+ * At the end of this function, we should have a path which has all
+ * slots incremented to the next position for a search. If we need to
+ * read a new node it will be NULL and the node above it will have the
+ * correct slot selected for a later read.
+ *
+ * If we increment the root nodes slot counter past the number of
+ * elements, 1 is returned to signal completion of the search.
+ */
+static int adjust_slots_upwards(struct btrfs_root *root,
+   struct btrfs_path *path, int root_level)
+{
+   int level = 0;
+   int nr, slot;
+   struct extent_buffer *eb;
+
+   if (root_level == 0)
+   return 1;
+
+   while (level <= root_level) {
+   eb = path->nodes[level];
+   nr = btrfs_header_nritems(eb);
+   path->slots[level]++;
+   slot = path->slots[level];
+   if (slot >= nr || level == 0) {
+   /*
+* Don't free the root -  we will detect this
+* condition after our loop and return a
+* positive value for caller to stop walking the tree.
+*/
+   if (level != root_level) {
+   btrfs_tree_unlock_rw(eb, path->locks[level]);
+   path->locks[level] = 0;
+
+   free_extent_buffer(eb);
+   path->nodes[level] = NULL;
+   path->slots[level] = 0;
+   }
+   } else {
+   /*
+* We have a valid slot to walk back down
+* from. Stop here so caller can process these
+* new nodes.
+*/
+   break;
+   }
+
+   level++;
+   }
+
+   eb = path->nodes[root_level];
+   if (path->slots[root_l

[PATCH 2/2] btrfs: syslog when quota is disabled

2014-07-07 Thread Anand Jain

Offline investigations of the issues would need to know when quota is disabled.

Signed-off-by: Anand Jain 
---
 fs/btrfs/ioctl.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index bb4a498..fd29978 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4221,6 +4221,8 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void 
__user *arg)
break;
case BTRFS_QUOTA_CTL_DISABLE:
ret = btrfs_quota_disable(trans, root->fs_info);
+   if (!ret)
+   btrfs_info(root->fs_info, "quota is disabled");
break;
default:
ret = -EINVAL;
-- 
2.0.0.257.g75cc6c6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] btrfs: syslog when quota is enabled

2014-07-07 Thread Anand Jain

must syslog when btrfs working config changes so is to support
offline investigation of the issues.
---
 fs/btrfs/ioctl.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 016a5eb..bb4a498 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4216,6 +4216,8 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void 
__user *arg)
switch (sa->cmd) {
case BTRFS_QUOTA_CTL_ENABLE:
ret = btrfs_quota_enable(trans, root->fs_info);
+   if (!ret)
+   btrfs_info(root->fs_info, "quota is enabled");
break;
case BTRFS_QUOTA_CTL_DISABLE:
ret = btrfs_quota_disable(trans, root->fs_info);
-- 
2.0.0.257.g75cc6c6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC] btrfs: code optimize use btrfs_get_bdev_and_sb() at btrfs_scan_one_device

2014-07-07 Thread Anand Jain

(for review comments pls).

btrfs_scan_one_device() needs SB, instead of doing it from scratch could
use btrfs_get_bdev_and_sb()

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 51 ++-
 1 file changed, 6 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c166355..94e6131 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1053,14 +1053,11 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
 {
struct btrfs_super_block *disk_super;
struct block_device *bdev;
-   struct page *page;
-   void *p;
int ret = -EINVAL;
u64 devid;
u64 transid;
u64 total_devices;
-   u64 bytenr;
-   pgoff_t index;
+   struct buffer_head *bh;
 
/*
 * we would like to check all the supers, but that would make
@@ -1068,44 +1065,12 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
 * So, we need to add a special mount option to scan for
 * later supers, using BTRFS_SUPER_MIRROR_MAX instead
 */
-   bytenr = btrfs_sb_offset(0);
mutex_lock(&uuid_mutex);
 
-   bdev = blkdev_get_by_path(path, flags, holder);
-
-   if (IS_ERR(bdev)) {
-   ret = PTR_ERR(bdev);
+   ret = btrfs_get_bdev_and_sb(path, flags, holder, 0, &bdev, &bh);
+   if (ret)
goto error;
-   }
-
-   /* make sure our super fits in the device */
-   if (bytenr + PAGE_CACHE_SIZE >= i_size_read(bdev->bd_inode))
-   goto error_bdev_put;
-
-   /* make sure our super fits in the page */
-   if (sizeof(*disk_super) > PAGE_CACHE_SIZE)
-   goto error_bdev_put;
-
-   /* make sure our super doesn't straddle pages on disk */
-   index = bytenr >> PAGE_CACHE_SHIFT;
-   if ((bytenr + sizeof(*disk_super) - 1) >> PAGE_CACHE_SHIFT != index)
-   goto error_bdev_put;
-
-   /* pull in the page with our super */
-   page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
-  index, GFP_NOFS);
-
-   if (IS_ERR_OR_NULL(page))
-   goto error_bdev_put;
-
-   p = kmap(page);
-
-   /* align our pointer to the offset of the super block */
-   disk_super = p + (bytenr & ~PAGE_CACHE_MASK);
-
-   if (btrfs_super_bytenr(disk_super) != bytenr ||
-   btrfs_super_magic(disk_super) != BTRFS_MAGIC)
-   goto error_unmap;
+   disk_super = (struct btrfs_super_block *) bh->b_data;
 
devid = btrfs_stack_device_id(&disk_super->dev_item);
transid = btrfs_super_generation(disk_super);
@@ -1125,13 +1090,9 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
printk(KERN_CONT "devid %llu transid %llu %s\n", devid, 
transid, path);
}
 
-
-error_unmap:
-   kunmap(page);
-   page_cache_release(page);
-
-error_bdev_put:
+   brelse(bh);
blkdev_put(bdev, flags);
+
 error:
mutex_unlock(&uuid_mutex);
return ret;
-- 
2.0.0.257.g75cc6c6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: test for valid bdev before kobj removal in btrfs_rm_device

2014-07-07 Thread Eric Sandeen

commit 4cd btrfs: dev delete should remove sysfs entry
added a btrfs_kobj_rm_device, which dereferences device->bdev...
right after we check whether device->bdev might be NULL.

I don't honestly know if it's possible to have a NULL device->bdev
here, but assuming that it is (given the test), we need to move
the kobject removal to be under that test.

(Coverity spotted this)

Signed-off-by: Eric Sandeen 
---

If it's not possible for bdev to be null, then the test should just
be removed, but that's above my current btrfs pay grade. ;)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6104676..6cb82f6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1680,11 +1680,11 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
if (device->bdev == root->fs_info->fs_devices->latest_bdev)
root->fs_info->fs_devices->latest_bdev = next_device->bdev;
 
-   if (device->bdev)
+   if (device->bdev) {
device->fs_devices->open_devices--;
-
-   /* remove sysfs entry */
-   btrfs_kobj_rm_device(root->fs_info, device);
+   /* remove sysfs entry */
+   btrfs_kobj_rm_device(root->fs_info, device);
+   }
 
call_rcu(&device->rcu, free_device);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Revert "btrfs: allow mounting btrfs subvolumes with different ro/rw options"

2014-07-07 Thread Goffredo Baroncelli

On 07/07/2014 03:46 AM, Qu Wenruo wrote:
> 
[... cut ...]
>> 
>> So to me it seems reasonable to have different rw/ro status between
>> btrfs root and btrfs subvolume. As use case think a system which
>> hosts several guests in container. Each guest has its own subvolume
>> as root filesystem. An user would mount the btrfs root RO in order
>> to see all the subvolume but at the same time he avoids to change a
>> file for error; when a guest has to be started, its root
>> filesystem/subvolume can be mounted RW.
> You caught me. Yes, the use case seems quite resonable since
> currently you need to mount btrfs to get the subvolume list (the only
> offline method seems btrfs-debug-tree but end-users won't use it
> anyway) and it's a good admin behavior to mount it ro if no need to
> write.
>> 
>> On the other side, I understand that this could lead to an
>> unexpected behaviour because with the other filesystem it is
>> impossible to mount only a part as RW. In this BTRFS would be
>> different.
>> 
>> Following the "least surprise" principle, I prefer that the *mount*
>> RO/RW flag acts globally: the filesystem has only one status. It is
>> possible to change it only globally.
>> 
>> In order to having a subvolumes with different RO/RW status we
>> should rely on different flag. I have to point out that the
>> subvolume has already the concept of read-only status.
>> 
>> We could adopt the following rules: 

>> - if the filesystem is mounted RO then all the subvolumes (event 
>> the id=5) are RO 


>>- if a subvolume is marked RO, the it is RO 
>> - otherwise a subvolume is RW

> I'm confused with rule 1. When mentionting 'mounted RO', you mean
> mount subvolume id=5 RO? Also you mentioned that using differnt RO/RW
> flag independent from VFS RO/RW flags, so it also makes me confused
> that when mentioning RO, did you mean VFS RO or new btrfs RO/RW
> flags?


For "mounted RO" I mean the VFS flag, the "one" passed via the mount 
command. I say "one" as 1, because I am convinced that it has to act globally, 
e.g. on the whole filesystem; the flag should be set at the first mount,
then it can be changed (only ?) issuing a "mount -o remount,rw/ro"

For example, the following commands

  # mount -o subvol=subvolname,ro /dev/sda1 /mnt/btrfs-subvol
  # mount -o subvolid=5 /dev/sda1 /mnt/btrfs-root

cause the following ones

  # touch /mnt/btrfs-subvol/touch-a-file
  # touch /mnt/btrfs-root/touch-a-file2

to fail; and the following commands

  # mount -o subvol=subvolname,ro /dev/sda1 /mnt/btrfs-subvol
  # mount -o subvolid=5 /dev/sda1 /mnt/btrfs-root
  # mount -o remount,rw /mnt/btrfs-subvol

cause the following ones 

  # touch /mnt/btrfs-subvol/touch-a-file
  # touch /mnt/btrfs-root/touch-a-file2

to succeed

So for each filesystem, there is a "globally" flag ro/rw which acts on the 
whole filesystem. Clear and simple.

Step 2: a more fine grained control of the subvolumes.
We have already the capability to make a subvolume read-only/read-write doing

   # btrfs property set -t s /path/to/subvolume ro true

or 

   # btrfs property set -t s /path/to/subvolume ro false

My idea is to use this flag. It could be done at the mount time for example:

  # mount -o subvolmode=ro,subvol=subvolname /dev/sda1 /

(this example doesn't work, it is only a my idea)

So:
- we should not add further code
- the semantic is simple
- the property is linked to the subvolume in a understandable way

We should only add the subvolmode=ro option to the mount command.

Further discussion need to investigate the following cases:
- if the filesystem is mounted as ro (mount -o ro), does mounting a 
subvolume
 rw ( mount -o subvolmode=rw...) should raise an error ?  (IMHO yes) 
- if the filesystem is mounted as ro (mount -o ro), does mounting the 
filesystem a 2nd time rw ( mount -o rw...) should raise an error ?  (IMHO yes) 
- if a subvolume is mounter rw (or ro), does mounting the same subvolume a 2nd 
time as ro (or rw) should raise an error ?  (IMHO yes) 


BR
G.Baroncelli

>> 
>> Moreover we can add further rules to inherit the subvolume RO/RW
>> status at the creation time (even tough it makes sense only for the
>> snapshot). We could use an xattr for that.
>> 
>> Finally I would like to point out that relying on the relationship
>> parent/child between the subvolumes is very dangerous. With the
>> exception if the subvolid=5 which is the only root one, it is very
>> easy to move up and down the subvolume. I have to point out this
>> because I read on another email that someone likes the idea to
>> having a RO subvolume because its parent is marked RO. But a
>> subvolume may be mounted also by id and not by its path (and or
>> name). So relying on the relationship parent/child would lead to
>> break the "least surprise principle".
>> 
>> My 2 ¢ BR G.Baroncelli
> Oh I forgot that user can mv subvolumes just like normal dirs. In
> this case it will certainly make ro/rw disaster if rely on the parent
> ro/rw status. :(
> 
> Thanks,

Re: mount time of multi-disk arrays

2014-07-07 Thread André-Sebastian Liebe

On 07/07/2014 04:14 PM, Austin S Hemmelgarn wrote:
> On 2014-07-07 09:54, Konstantinos Skarlatos wrote:
>> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>>> Hello List,
>>>
>>> can anyone tell me how much time is acceptable and assumable for a
>>> multi-disk btrfs array with classical hard disk drives to mount?
>>>
>>> I'm having a bit of trouble with my current systemd setup, because it
>>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>>> the 4 drive setup it failed to mount once in a few times. Now it fails
>>> everytime because the default timeout of 1m 30s is reached and mount is
>>> aborted.
>>> My last 10 manual mounts took between 1m57s and 2m12s to finish.
>> I have the exact same problem, and have to manually mount my large
>> multi-disk btrfs filesystems, so I would be interested in a solution as
>> well.
>>
>>> My hardware setup contains a
>>> - Intel Core i7 4770
>>> - Kernel 3.15.2-1-ARCH
>>> - 32GB RAM
>>> - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
>>> - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)
>>>
>>> Thanks in advance
>>>
>>> André-Sebastian Liebe
>>> --
>>>
>>>
>>> # btrfs fi sh
>>> Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
>>>  Total devices 5 FS bytes used 14.21TiB
>>>  devid1 size 3.64TiB used 2.86TiB path /dev/sdd
>>>  devid2 size 3.64TiB used 2.86TiB path /dev/sdc
>>>  devid3 size 3.64TiB used 2.86TiB path /dev/sdf
>>>  devid4 size 3.64TiB used 2.86TiB path /dev/sde
>>>  devid5 size 3.64TiB used 2.88TiB path /dev/sdb
>>>
>>> Btrfs v3.14.2-dirty
>>>
>>> # btrfs fi df /data/pool0/
>>> Data, single: total=14.28TiB, used=14.19TiB
>>> System, RAID1: total=8.00MiB, used=1.54MiB
>>> Metadata, RAID1: total=26.00GiB, used=20.20GiB
>>> unknown, single: total=512.00MiB, used=0.00
> This is interesting, I actually did some profiling of the mount timings
> for a bunch of different configurations of 4 (identical other than
> hardware age) 1TB Seagate disks.  One of the arrangements I tested was
> Data using single profile and Metadata/System using RAID1.  Based on the
> results I got, and what you are reporting, the mount time doesn't scale
> linearly in proportion to the amount of storage space.
>
> You might want to try the RAID10 profile for Metadata, of the
> configurations I tested, the fastest used Single for Data and RAID10 for
> Metadata/System.
Switching Metadata from raid1 to raid10 reduced mount times from roughly
120s to 38s!
>
> Also, based on the System chunk usage, I'm guessing that you have a LOT
> of subvolumes/snapshots, and I do know that having very large (100+)
> numbers of either does slow down the mount command (I don't think that
> we cache subvolume information between mount invocations, so it has to
> re-parse the system chunks for each individual mount).
No, I had to remove the one and only snapshot to recover from a 'no
space left on device' to regain metadata space
(http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html)

-- 
André-Sebastian Liebe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mount time of multi-disk arrays

2014-07-07 Thread Benjamin O'Connor

As a point of reference, my BTRFS filesystem with 11 x 21TB devices in 
RAID0 with space cache enabled takes about 4 minutes to mount after a 
clean unmount.


There is a decent amount of variation in the amount of time (has been as 
low as 3 minutes or taken 5 minutes or longer).  These devices are all 
connected via 10gb iscsi.


Mount time seems to have not increased relative to the number of devices 
(so far).  I think that back when we had only 6 devices, it still took 
roughly that amount of time.


-ben

--
-
Benjamin O'Connor
TechOps Systems Administrator
TripAdvisor Media Group

be...@tripadvisor.com
c. 617-312-9072
-


Duncan wrote:

Konstantinos Skarlatos posted on Mon, 07 Jul 2014 16:54:05 +0300 as
excerpted:


On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:

can anyone tell me how much time is acceptable and assumable for a
multi-disk btrfs array with classical hard disk drives to mount?

I'm having a bit of trouble with my current systemd setup, because it
couldn't mount my btrfs raid anymore after adding the 5th drive. With
the 4 drive setup it failed to mount once in a few times. Now it fails
everytime because the default timeout of 1m 30s is reached and mount is
aborted.
My last 10 manual mounts took between 1m57s and 2m12s to finish.



I have the exact same problem, and have to manually mount my large
multi-disk btrfs filesystems, so I would be interested in a solution as
well.


I don't have a direct answer, as my btrfs devices are all SSD, but...

a) Btrfs, like some other filesystems, is designed not to need a
pre-mount (or pre-rw-mount) fsck, because it does what /should/ be a
quick-scan at mount-time.  However, that isn't always as quick as it
might be for a number of reasons:

a1) Btrfs is still a relatively immature filesystem and certain
operations are not yet optimized.  In particular, multi-device btrfs
operations tend to still be using a first-working-implementation type of
algorithm instead of a well optimized for parallel operation algorithm,
and thus often serialize access to multiple devices where a more
optimized algorithm would parallelize operations across multiple devices
at the same time.  That will come, but it's not there yet.

a2) Certain operations such as orphan cleanup ("orphans" are files that
were deleted while they were in use and thus weren't fully deleted at the
time; if they were still in use at unmount (remount-read-only), cleanup
is done at mount-time) can delay mount as well.

a3) Inode_cache mount option:  Don't use this unless you can explain
exactly WHY you are using it, preferably backed up with benchmark
numbers, etc.  It's useful only on 32-bit, generally high-file-activity
server systems and has general-case problems, including long mount times
and possible overflow issues that make it inappropriate for normal use.
Unfortunately there's a lot of people out there using it that shouldn't
be, and I even saw it listed on at least one distro (not mine!) wiki. =:^(

a4) The space_cache mount option OTOH *IS* appropriate for normal use
(and is in fact enabled by default these days), but particularly in
improper shutdown cases can require rebuilding at mount time -- altho
this should happen /after/ mount, the system will just be busy for some
minutes, until the space-cache is rebuilt.  But the IO from a space_cache
rebuild on one filesystem could slow down the mounting of filesystems
that mount after it, as well as the boot-time launching of other post-
mount launched services.

If you're seeing the time go up dramatically with the addition of more
filesystem devices, however, and you do /not/ have inode_cache active,
I'd guess it's mainly the not-yet-optimized multi-device operations.


b) As with any systemd launched unit, however, there's systemd
configuration mechanisms for working around specific unit issues,
including timeout issues.  Of course most systems continue to use fstab
and let systemd auto-generate the mount units, and in fact that is
recommended, but either with fstab or directly created mount units,
there's a timeout configuration option that can be set.

b1) The general systemd *.mount unit [Mount] section option appears to be
TimeoutSec=.  As is usual with systemd times, the default is seconds, or
pass the unit(s, like "5min 20s").

b2) I don't see it /specifically/ stated, but with a bit of reading
between the lines, the corresponding fstab option appears to be either
x-systemd.timeoutsec= or x-systemd.TimeoutSec= (IOW I'm not sure of the
case).  You may also want to try x-systemd.device-timeout=, which /is/
specifically mentioned, altho that appears to be specifically the timeout
for the device to appear, NOT for the filesystem to mount after it does.

b3) See the systemd.mount (5) and systemd-fstab-generator (8) manpages
for more, that being what the above is based on.

So it might take a bit of experimentation to find the exact command, but
based on the above anyway, it /sh

Re: mount time of multi-disk arrays

2014-07-07 Thread Duncan

Konstantinos Skarlatos posted on Mon, 07 Jul 2014 16:54:05 +0300 as
excerpted:

> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>>
>> can anyone tell me how much time is acceptable and assumable for a
>> multi-disk btrfs array with classical hard disk drives to mount?
>>
>> I'm having a bit of trouble with my current systemd setup, because it
>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>> the 4 drive setup it failed to mount once in a few times. Now it fails
>> everytime because the default timeout of 1m 30s is reached and mount is
>> aborted.
>> My last 10 manual mounts took between 1m57s and 2m12s to finish.

> I have the exact same problem, and have to manually mount my large
> multi-disk btrfs filesystems, so I would be interested in a solution as
> well.

I don't have a direct answer, as my btrfs devices are all SSD, but...

a) Btrfs, like some other filesystems, is designed not to need a
pre-mount (or pre-rw-mount) fsck, because it does what /should/ be a 
quick-scan at mount-time.  However, that isn't always as quick as it 
might be for a number of reasons:

a1) Btrfs is still a relatively immature filesystem and certain 
operations are not yet optimized.  In particular, multi-device btrfs 
operations tend to still be using a first-working-implementation type of 
algorithm instead of a well optimized for parallel operation algorithm, 
and thus often serialize access to multiple devices where a more 
optimized algorithm would parallelize operations across multiple devices 
at the same time.  That will come, but it's not there yet.

a2) Certain operations such as orphan cleanup ("orphans" are files that 
were deleted while they were in use and thus weren't fully deleted at the 
time; if they were still in use at unmount (remount-read-only), cleanup 
is done at mount-time) can delay mount as well.

a3) Inode_cache mount option:  Don't use this unless you can explain 
exactly WHY you are using it, preferably backed up with benchmark 
numbers, etc.  It's useful only on 32-bit, generally high-file-activity 
server systems and has general-case problems, including long mount times 
and possible overflow issues that make it inappropriate for normal use.  
Unfortunately there's a lot of people out there using it that shouldn't 
be, and I even saw it listed on at least one distro (not mine!) wiki. =:^(

a4) The space_cache mount option OTOH *IS* appropriate for normal use 
(and is in fact enabled by default these days), but particularly in 
improper shutdown cases can require rebuilding at mount time -- altho 
this should happen /after/ mount, the system will just be busy for some 
minutes, until the space-cache is rebuilt.  But the IO from a space_cache 
rebuild on one filesystem could slow down the mounting of filesystems 
that mount after it, as well as the boot-time launching of other post-
mount launched services.

If you're seeing the time go up dramatically with the addition of more 
filesystem devices, however, and you do /not/ have inode_cache active, 
I'd guess it's mainly the not-yet-optimized multi-device operations.

b) As with any systemd launched unit, however, there's systemd 
configuration mechanisms for working around specific unit issues, 
including timeout issues.  Of course most systems continue to use fstab 
and let systemd auto-generate the mount units, and in fact that is 
recommended, but either with fstab or directly created mount units, 
there's a timeout configuration option that can be set.

b1) The general systemd *.mount unit [Mount] section option appears to be 
TimeoutSec=.  As is usual with systemd times, the default is seconds, or 
pass the unit(s, like "5min 20s").

b2) I don't see it /specifically/ stated, but with a bit of reading 
between the lines, the corresponding fstab option appears to be either
x-systemd.timeoutsec= or x-systemd.TimeoutSec= (IOW I'm not sure of the 
case).  You may also want to try x-systemd.device-timeout=, which /is/ 
specifically mentioned, altho that appears to be specifically the timeout 
for the device to appear, NOT for the filesystem to mount after it does.

b3) See the systemd.mount (5) and systemd-fstab-generator (8) manpages 
for more, that being what the above is based on.

So it might take a bit of experimentation to find the exact command, but 
based on the above anyway, it /should/ be pretty easy to tell systemd to 
wait a bit longer for that filesystem.

When you find the right invocation, please reply with it here, as I'm 
sure there's others who will benefit as well.  FWIW, I'm still on 
reiserfs for my spinning rust (only btrfs on my ssds), but I expect I'll 
switch them to btrfs at some point, so I may well use the information 
myself.  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a

Re: [PATCH 2/2] btrfs-progs: Add mount point check for 'btrfs fi df' command

2014-07-07 Thread Eric Sandeen

On 7/4/14, 8:52 AM, David Sterba wrote:
> On Fri, Jul 04, 2014 at 04:38:49PM +0800, Qu Wenruo wrote:
>> 'btrfs fi df' command is currently able to be executed on any file/dir
>> inside btrfs since it uses btrfs ioctl to get disk usage info.
>>
>> However it is somewhat confusing for some end users since normally such
>> command should only be executed on a mount point.
> 
> I disagree here, it's much more convenient to run 'fi df' anywhere and
> get the output. The system 'df' command works the same way.

I agree with that, and said as much in the original bug filed @Fedora.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2] btrfs compression: merge inflate and deflate z_streams

2014-07-07 Thread Sergey Senozhatsky

Hello,
This patch reduces zlib compression memory usage by `merging' inflate
and deflate streams into a single stream.

-- v2: rebased-on linux-next rc4 20140707

Sergey Senozhatsky (1):
  btrfs compression: merge inflate and deflate z_streams

 fs/btrfs/zlib.c | 138 
 1 file changed, 68 insertions(+), 70 deletions(-)

-- 
2.0.1.612.gea98109

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2] btrfs compression: merge inflate and deflate z_streams

2014-07-07 Thread Sergey Senozhatsky

`struct workspace' used for zlib compression contains two zlib
z_stream-s: `def_strm' used in zlib_compress_pages(), and `inf_strm'
used in zlib_decompress/zlib_decompress_biovec(). None of these
functions use `inf_strm' and `def_strm' simultaniously, meaning that
for every compress/decompress operation we need only one z_stream
(out of two available).

`inf_strm' and `def_strm' are different in size of ->workspace. For
inflate stream we vmalloc() zlib_inflate_workspacesize() bytes, for
deflate stream - zlib_deflate_workspacesize() bytes. On my system zlib
returns the following workspace sizes, correspondingly: 42312 and 268104
(+ guard pages).

Keep only one `z_stream' in `struct workspace' and use it for both
compression and decompression. Hence, instead of vmalloc() of two
z_stream->worskpace-s, allocate only one of size:
max(zlib_deflate_workspacesize(), zlib_inflate_workspacesize())

Reviewed-by: David Sterba 
Signed-off-by: Sergey Senozhatsky 
---
 fs/btrfs/zlib.c | 138 
 1 file changed, 68 insertions(+), 70 deletions(-)

diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index b67d8fc..fa56a56 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -33,8 +33,7 @@
 #include "compression.h"
 
 struct workspace {
-   z_stream inf_strm;
-   z_stream def_strm;
+   z_stream strm;
char *buf;
struct list_head list;
 };
@@ -43,8 +42,7 @@ static void zlib_free_workspace(struct list_head *ws)
 {
struct workspace *workspace = list_entry(ws, struct workspace, list);
 
-   vfree(workspace->def_strm.workspace);
-   vfree(workspace->inf_strm.workspace);
+   vfree(workspace->strm.workspace);
kfree(workspace->buf);
kfree(workspace);
 }
@@ -52,17 +50,17 @@ static void zlib_free_workspace(struct list_head *ws)
 static struct list_head *zlib_alloc_workspace(void)
 {
struct workspace *workspace;
+   int workspacesize;
 
workspace = kzalloc(sizeof(*workspace), GFP_NOFS);
if (!workspace)
return ERR_PTR(-ENOMEM);
 
-   workspace->def_strm.workspace = vmalloc(zlib_deflate_workspacesize(
-   MAX_WBITS, MAX_MEM_LEVEL));
-   workspace->inf_strm.workspace = vmalloc(zlib_inflate_workspacesize());
+   workspacesize = max(zlib_deflate_workspacesize(MAX_WBITS, 
MAX_MEM_LEVEL),
+   zlib_inflate_workspacesize());
+   workspace->strm.workspace = vmalloc(workspacesize);
workspace->buf = kmalloc(PAGE_CACHE_SIZE, GFP_NOFS);
-   if (!workspace->def_strm.workspace ||
-   !workspace->inf_strm.workspace || !workspace->buf)
+   if (!workspace->strm.workspace || !workspace->buf)
goto fail;
 
INIT_LIST_HEAD(&workspace->list);
@@ -96,14 +94,14 @@ static int zlib_compress_pages(struct list_head *ws,
*total_out = 0;
*total_in = 0;
 
-   if (Z_OK != zlib_deflateInit(&workspace->def_strm, 3)) {
+   if (Z_OK != zlib_deflateInit(&workspace->strm, 3)) {
printk(KERN_WARNING "BTRFS: deflateInit failed\n");
ret = -EIO;
goto out;
}
 
-   workspace->def_strm.total_in = 0;
-   workspace->def_strm.total_out = 0;
+   workspace->strm.total_in = 0;
+   workspace->strm.total_out = 0;
 
in_page = find_get_page(mapping, start >> PAGE_CACHE_SHIFT);
data_in = kmap(in_page);
@@ -117,25 +115,25 @@ static int zlib_compress_pages(struct list_head *ws,
pages[0] = out_page;
nr_pages = 1;
 
-   workspace->def_strm.next_in = data_in;
-   workspace->def_strm.next_out = cpage_out;
-   workspace->def_strm.avail_out = PAGE_CACHE_SIZE;
-   workspace->def_strm.avail_in = min(len, PAGE_CACHE_SIZE);
+   workspace->strm.next_in = data_in;
+   workspace->strm.next_out = cpage_out;
+   workspace->strm.avail_out = PAGE_CACHE_SIZE;
+   workspace->strm.avail_in = min(len, PAGE_CACHE_SIZE);
 
-   while (workspace->def_strm.total_in < len) {
-   ret = zlib_deflate(&workspace->def_strm, Z_SYNC_FLUSH);
+   while (workspace->strm.total_in < len) {
+   ret = zlib_deflate(&workspace->strm, Z_SYNC_FLUSH);
if (ret != Z_OK) {
printk(KERN_DEBUG "BTRFS: deflate in loop returned 
%d\n",
   ret);
-   zlib_deflateEnd(&workspace->def_strm);
+   zlib_deflateEnd(&workspace->strm);
ret = -EIO;
goto out;
}
 
/* we're making it bigger, give up */
-   if (workspace->def_strm.total_in > 8192 &&
-   workspace->def_strm.total_in <
-   workspace->def_strm.total_out) {
+   if (workspace->strm.total_in > 8192 &&
+   workspace->strm.total_in <
+   workspace->strm.total_o

[PATCH] Btrfs-progs: fix Segmentation fault of btrfs-convert

2014-07-07 Thread Liu Bo

Recently we merge a memory leak fix, which fails xfstests/btrfs/012,
the cause is that it only frees @fs_devices but leaves it on the global
fs_uuid list, which cause a 'Segmentation fault' over running command
btrfs-convert.  This fixes the problem.

Signed-off-by: Liu Bo 
---
 volumes.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/volumes.c b/volumes.c
index a61928c..8b827fa 100644
--- a/volumes.c
+++ b/volumes.c
@@ -184,11 +184,17 @@ again:
seed_devices = fs_devices->seed;
fs_devices->seed = NULL;
if (seed_devices) {
+   struct btrfs_fs_devices *orig;
+
+   orig = fs_devices;
fs_devices = seed_devices;
+   list_del(&orig->list);
+   free(orig);
goto again;
+   } else {
+   list_del(&fs_devices->list);
+   free(fs_devices);
}
-
-   free(fs_devices);
return 0;
 }
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs compression: merge inflate and deflate z_streams

2014-07-07 Thread Sergey Senozhatsky

On (07/01/14 16:44), David Sterba wrote:
> On Tue, Jul 01, 2014 at 12:32:10AM +0900, Sergey Senozhatsky wrote:
> > `struct workspace' used for zlib compression contains two zlib
> > z_stream-s: `def_strm' used in zlib_compress_pages(), and `inf_strm'
> > used in zlib_decompress/zlib_decompress_biovec(). None of these
> > functions use `inf_strm' and `def_strm' simultaniously, meaning that
> > for every compress/decompress operation we need only one z_stream
> > (out of two available).
> > 
> > `inf_strm' and `def_strm' are different in size of ->workspace. For
> > inflate stream we vmalloc() zlib_inflate_workspacesize() bytes, for
> > deflate stream - zlib_deflate_workspacesize() bytes. On my system zlib
> > returns the following workspace sizes, correspondingly: 42312 and 268104
> > (+ guard pages).
> > 
> > Keep only one `z_stream' in `struct workspace' and use it for both
> > compression and decompression. Hence, instead of vmalloc() of two
> > z_stream->worskpace-s, allocate only one of size:
> > max(zlib_deflate_workspacesize(), zlib_inflate_workspacesize())
> > 
> > Signed-off-by: Sergey Senozhatsky 
> 
> Reviewed-by: David Sterba 
> 

Hello,

the patch does not apply against linux-next rc4-20140707 due to 130d5b415a091e.
unhappy hunk is:

+   if (workspace->strm.total_in > 8192 &&
+   workspace->strm.total_in <
+   workspace->strm.total_out) {
ret = -EIO;

now it should be:
+   if (workspace->strm.total_in > 8192 &&
+   workspace->strm.total_in <
+   workspace->strm.total_out) {
ret = -E2BIG;


I'll rebase and resend.

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mount time of multi-disk arrays

2014-07-07 Thread André-Sebastian Liebe

On 07/07/2014 03:54 PM, Konstantinos Skarlatos wrote:
> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>> Hello List,
>>
>> can anyone tell me how much time is acceptable and assumable for a
>> multi-disk btrfs array with classical hard disk drives to mount?
>>
>> I'm having a bit of trouble with my current systemd setup, because it
>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>> the 4 drive setup it failed to mount once in a few times. Now it fails
>> everytime because the default timeout of 1m 30s is reached and mount is
>> aborted.
>> My last 10 manual mounts took between 1m57s and 2m12s to finish.
> I have the exact same problem, and have to manually mount my large
> multi-disk btrfs filesystems, so I would be interested in a solution
> as well.
Hi Konstantinos , you can workaround this by manual creating a systemd
mount unit.

- First review the autogenerated systemd mount unit (systemctl show
.mount).  You you can get the unit name by issuing a
'systemctl' and look for your failed mount.
- Then you have to take the needed values (After, Before, Conflicts,
RequiresMountsFor, Where, What, Options, Type, Wantedby) and put them
into an new systemd mount unit file (possibly under
/usr/lib/systemd/system/.mount ).
- Now just add the TimeoutSec with a large enough value below [Mount].
- If you later want to automount you raid, add the WantedBy under [Install]
- now issue a 'systemctl daemon-reload' and look for error messages in
syslog.
- If there are no errors you could enable your manual mount entry by
'systemctl enable .mount' and safely comment out your
old fstab entry (systemd no longer generates autogenerated units).

-- 8< --- 8< --- 8< --- 8< --- 8<
--- 8< --- 8< ---
[Unit]
Description=Mount /data/pool0
After=dev-disk-by\x2duuid-066141c6\x2d16ca\x2d4a30\x2db55c\x2de606b90ad0fb.device
systemd-journald.socket local-fs-pre.target system.slice -.mount
Before=umount.target
Conflicts=umount.target
RequiresMountsFor=/data
/dev/disk/by-uuid/066141c6-16ca-4a30-b55c-e606b90ad0fb

[Mount]
Where=/data/pool0
What=/dev/disk/by-uuid/066141c6-16ca-4a30-b55c-e606b90ad0fb
Options=rw,relatime,skip_balance,compress
Type=btrfs
TimeoutSec=3min

[Install]
WantedBy=dev-disk-by\x2duuid-066141c6\x2d16ca\x2d4a30\x2db55c\x2de606b90ad0fb.device
-- 8< --- 8< --- 8< --- 8< --- 8<
--- 8< --- 8< ---


>
>>
>> My hardware setup contains a
>> - Intel Core i7 4770
>> - Kernel 3.15.2-1-ARCH
>> - 32GB RAM
>> - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
>> - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)
>>
>> Thanks in advance
>>
>> André-Sebastian Liebe
>> --
>>
>>
>> # btrfs fi sh
>> Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
>>  Total devices 5 FS bytes used 14.21TiB
>>  devid1 size 3.64TiB used 2.86TiB path /dev/sdd
>>  devid2 size 3.64TiB used 2.86TiB path /dev/sdc
>>  devid3 size 3.64TiB used 2.86TiB path /dev/sdf
>>  devid4 size 3.64TiB used 2.86TiB path /dev/sde
>>  devid5 size 3.64TiB used 2.88TiB path /dev/sdb
>>
>> Btrfs v3.14.2-dirty
>>
>> # btrfs fi df /data/pool0/
>> Data, single: total=14.28TiB, used=14.19TiB
>> System, RAID1: total=8.00MiB, used=1.54MiB
>> Metadata, RAID1: total=26.00GiB, used=20.20GiB
>> unknown, single: total=512.00MiB, used=0.00
>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> -- 
> Konstantinos Skarlatos
--
André-Sebastian Liebe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mount time of multi-disk arrays

2014-07-07 Thread Austin S Hemmelgarn

On 2014-07-07 09:54, Konstantinos Skarlatos wrote:
> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>> Hello List,
>>
>> can anyone tell me how much time is acceptable and assumable for a
>> multi-disk btrfs array with classical hard disk drives to mount?
>>
>> I'm having a bit of trouble with my current systemd setup, because it
>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>> the 4 drive setup it failed to mount once in a few times. Now it fails
>> everytime because the default timeout of 1m 30s is reached and mount is
>> aborted.
>> My last 10 manual mounts took between 1m57s and 2m12s to finish.
> I have the exact same problem, and have to manually mount my large
> multi-disk btrfs filesystems, so I would be interested in a solution as
> well.
> 
>>
>> My hardware setup contains a
>> - Intel Core i7 4770
>> - Kernel 3.15.2-1-ARCH
>> - 32GB RAM
>> - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
>> - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)
>>
>> Thanks in advance
>>
>> André-Sebastian Liebe
>> --
>>
>>
>> # btrfs fi sh
>> Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
>>  Total devices 5 FS bytes used 14.21TiB
>>  devid1 size 3.64TiB used 2.86TiB path /dev/sdd
>>  devid2 size 3.64TiB used 2.86TiB path /dev/sdc
>>  devid3 size 3.64TiB used 2.86TiB path /dev/sdf
>>  devid4 size 3.64TiB used 2.86TiB path /dev/sde
>>  devid5 size 3.64TiB used 2.88TiB path /dev/sdb
>>
>> Btrfs v3.14.2-dirty
>>
>> # btrfs fi df /data/pool0/
>> Data, single: total=14.28TiB, used=14.19TiB
>> System, RAID1: total=8.00MiB, used=1.54MiB
>> Metadata, RAID1: total=26.00GiB, used=20.20GiB
>> unknown, single: total=512.00MiB, used=0.00

This is interesting, I actually did some profiling of the mount timings
for a bunch of different configurations of 4 (identical other than
hardware age) 1TB Seagate disks.  One of the arrangements I tested was
Data using single profile and Metadata/System using RAID1.  Based on the
results I got, and what you are reporting, the mount time doesn't scale
linearly in proportion to the amount of storage space.

You might want to try the RAID10 profile for Metadata, of the
configurations I tested, the fastest used Single for Data and RAID10 for
Metadata/System.

Also, based on the System chunk usage, I'm guessing that you have a LOT
of subvolumes/snapshots, and I do know that having very large (100+)
numbers of either does slow down the mount command (I don't think that
we cache subvolume information between mount invocations, so it has to
re-parse the system chunks for each individual mount).

smime.p7s
Description: S/MIME Cryptographic Signature

Re: mount time of multi-disk arrays

2014-07-07 Thread Konstantinos Skarlatos


On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:

Hello List,

can anyone tell me how much time is acceptable and assumable for a
multi-disk btrfs array with classical hard disk drives to mount?

I'm having a bit of trouble with my current systemd setup, because it
couldn't mount my btrfs raid anymore after adding the 5th drive. With
the 4 drive setup it failed to mount once in a few times. Now it fails
everytime because the default timeout of 1m 30s is reached and mount is
aborted.
My last 10 manual mounts took between 1m57s and 2m12s to finish.
I have the exact same problem, and have to manually mount my large 
multi-disk btrfs filesystems, so I would be interested in a solution as 
well.




My hardware setup contains a
- Intel Core i7 4770
- Kernel 3.15.2-1-ARCH
- 32GB RAM
- dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
- dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)

Thanks in advance

André-Sebastian Liebe
--

# btrfs fi sh
Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
 Total devices 5 FS bytes used 14.21TiB
 devid1 size 3.64TiB used 2.86TiB path /dev/sdd
 devid2 size 3.64TiB used 2.86TiB path /dev/sdc
 devid3 size 3.64TiB used 2.86TiB path /dev/sdf
 devid4 size 3.64TiB used 2.86TiB path /dev/sde
 devid5 size 3.64TiB used 2.88TiB path /dev/sdb

Btrfs v3.14.2-dirty

# btrfs fi df /data/pool0/
Data, single: total=14.28TiB, used=14.19TiB
System, RAID1: total=8.00MiB, used=1.54MiB
Metadata, RAID1: total=26.00GiB, used=20.20GiB
unknown, single: total=512.00MiB, used=0.00


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

mount time of multi-disk arrays

2014-07-07 Thread André-Sebastian Liebe

Hello List,

can anyone tell me how much time is acceptable and assumable for a
multi-disk btrfs array with classical hard disk drives to mount?

I'm having a bit of trouble with my current systemd setup, because it
couldn't mount my btrfs raid anymore after adding the 5th drive. With
the 4 drive setup it failed to mount once in a few times. Now it fails
everytime because the default timeout of 1m 30s is reached and mount is
aborted.
My last 10 manual mounts took between 1m57s and 2m12s to finish.

My hardware setup contains a
- Intel Core i7 4770
- Kernel 3.15.2-1-ARCH
- 32GB RAM
- dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
- dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)

Thanks in advance

André-Sebastian Liebe
--

# btrfs fi sh
Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
Total devices 5 FS bytes used 14.21TiB
devid1 size 3.64TiB used 2.86TiB path /dev/sdd
devid2 size 3.64TiB used 2.86TiB path /dev/sdc
devid3 size 3.64TiB used 2.86TiB path /dev/sdf
devid4 size 3.64TiB used 2.86TiB path /dev/sde
devid5 size 3.64TiB used 2.88TiB path /dev/sdb

Btrfs v3.14.2-dirty

# btrfs fi df /data/pool0/
Data, single: total=14.28TiB, used=14.19TiB
System, RAID1: total=8.00MiB, used=1.54MiB
Metadata, RAID1: total=26.00GiB, used=20.20GiB
unknown, single: total=512.00MiB, used=0.00


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qgroup destroy / assign

2014-07-07 Thread Kevin Brandstatter

Wang,

Yes that certainly helps me make more sense of it. I was able to get the
qgroup assigning to work properly.
I guess the next question would be if it would be a valid feature to
implement automatic qgroup deletion
when a subvolume is destroyed. I suppose in order to help alleviate
issues with that perhaps it may also be useful to
require user created qgroups to be at least level 1. that way it would
be trivial to detect qgroups that were created for subvolumes, as they
would all be level 0. I don't think this would cause any issues since
you can't assign a subvolume to another qgroup from what I can tell,
only a qgroup to another qgroup.

-Kevin

On 07/06/2014 08:57 PM, Wang Shilong wrote:
> Hi Kevin,
>
> On 07/05/2014 05:10 AM, Kevin Brandstatter wrote:
>> how are qgroups accounted for? Are they specifially tied to one
>> subvolume on creation?
> Qgroup implementation is aslo a little confusing for me at first:-) .
>
> Yes, a qgroup is created automatically tied to one subvolume on creation
> with the same objectid.
>
> To implement qgroup group, you may want to do something like following:
>
> [1/1]
> /   \
>/ \
>  sub1(5)   subv2(257)
>
>>
>> If so, is it possible to auto delete relavant qgroups on deletion of the
>> subvolume?
> I supposed so, according to latest qgroup patches flighting on, a
> subvolume
> qgroup should be destroyed safely, when it finished sub-tree space
> accounting.
>
>>
>> also, how exactly does qgroup assign work? I havent been able to get it
>> to work at all.
>> in btrfsprogs cmds-cgroup.c
>> if ((args.src >> 48) >= (args.dst >> 48)) {
>>fprintf(stderr, "ERROR: bad relation requested '%s'\n", path);
>>return 1;
>> }
> Oh, this is to implement a strict level qgroup group, which means a
> u64 is
> divided into two parts, 16bits for level and the rest for id.
>
> So we ask parent qgroup's level must be greater than child's qgroup.that
> is the code you see above.
>
> You could create a qgroup relation like this:
>
> # btrfs qgroup assign 256 1/1 
>
> Hopely, this could help you.
>> always seems to fail. I tried creating another qgroup id 1000, and
>> assigning it to as sub, and vice versa, as well as assigning the sub to
>> the root, and vice versa, as well as one subvol to another.
>> The fixme comment leads me to believe that the src should be a path not
>> a qgroup ("FIXME src should accept subvol path")
>> but the progs let me create a qgroup without a subvol, which makes sense
>> if you want to be able to have some meta-qgroup for a bunch of subvols.
>> Further on noticing that a sub create also creates a qgroup with the
>> same id as the subvol, it would seem that the qgroup is tied to the
>> subvol via this shared id.
>>
>> -Kevin Brandstatter
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: set error return value in btrfs_get_blocks_direct

2014-07-07 Thread Filipe Manana

We were returning with 0 (success) because we weren't extracting the
error code from em (PTR_ERR(em)). Fix it.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6b65fab..8a946c0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6998,8 +6998,10 @@ static int btrfs_get_blocks_direct(struct inode *inode, 
sector_t iblock,
   block_start, len,
   orig_block_len,
   ram_bytes, type);
-   if (IS_ERR(em))
+   if (IS_ERR(em)) {
+   ret = PTR_ERR(em);
goto unlock_err;
+   }
}
 
ret = btrfs_add_ordered_extent_dio(inode, start,
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V2 7/9] btrfs: fix null pointer dereference in clone_fs_devices when name is null

2014-07-07 Thread Anand Jain




On 07/07/2014 12:22, Miao Xie wrote:

On Mon, 7 Jul 2014 12:04:09 +0800, Anand Jain wrote:

when one of the device path is missing btrfs_device name is null. So this
patch will check for that.

stack:
BUG: unable to handle kernel NULL pointer dereference at 0010
IP: [] strlen+0x0/0x30
[] ? clone_fs_devices+0xaa/0x160 [btrfs]
[] btrfs_init_new_device+0x317/0xca0 [btrfs]
[] ? __kmalloc_track_caller+0x15a/0x1a0
[] btrfs_ioctl+0xaa3/0x2860 [btrfs]
[] ? handle_mm_fault+0x48c/0x9c0
[] ? __blkdev_put+0x171/0x180
[] ? __do_page_fault+0x4ac/0x590
[] ? blkdev_put+0x106/0x110
[] ? mntput+0x35/0x40
[] do_vfs_ioctl+0x460/0x4a0
[] ? fput+0xe/0x10
[] ? task_work_run+0xb3/0xd0
[] SyS_ioctl+0x57/0x90
[] ? do_page_fault+0xe/0x10
[] system_call_fastpath+0x16/0x1b

reproducer:
mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
btrfstune -S 1 /dev/sdg1
modprobe -r btrfs && modprobe btrfs
mount -o degraded /dev/sdg1 /btrfs
btrfs dev add /dev/sdg3 /btrfs

Signed-off-by: Anand Jain 
Signed-off-by: Miao Xie 
---
Changelog v1->v2:
- Fix the problem that we forgot to set the missing flag for the cloned device
---
   fs/btrfs/volumes.c | 25 -
   1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1891541..4731bd6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -598,16 +598,23 @@ static struct btrfs_fs_devices *clone_fs_devices(struct 
btrfs_fs_devices *orig)
   if (IS_ERR(device))
   goto error;

-/*
- * This is ok to do without rcu read locked because we hold the
- * uuid mutex so nothing we touch in here is going to disappear.
- */
-name = rcu_string_strdup(orig_dev->name->str, GFP_NOFS);
-if (!name) {
-kfree(device);
-goto error;
+if (orig_dev->missing) {
+device->missing = 1;
+fs_devices->missing_devices++;


  as mentioned in some places we just check name (for missing device)
  and  don't set the missing flag so it better to ..

  if (orig_dev->missing || !orig_dev->name) {
 device->missing = 1;
 fs_devices->missing_devices++;


I don't think we need check name pointer here because only missing device
doesn't have its own name. Or there is something wrong in the code, so
I add assert in else branch. Am I right?


At few critical code, the below and I guess in the chunk/strips
function as well, we don't make use of missing flag, but rather
->name.

-
btrfsic_process_superblock
::
if (!device->bdev || !device->name)
continue;
-

 But here without !orig_dev->name check, is also good enough.

Thanks, Anand



+} else {
+ASSERT(orig_dev->name);
+/*
+ * This is ok to do without rcu read locked because
+ * we hold the uuid mutex so nothing we touch in here
+ * is going to disappear.
+ */
+name = rcu_string_strdup(orig_dev->name->str, GFP_NOFS);
+if (!name) {
+kfree(device);
+goto error;
+}
+rcu_assign_pointer(device->name, name);
   }
-rcu_assign_pointer(device->name, name);

   list_add(&device->dev_list, &fs_devices->devices);
   device->fs_devices = fs_devices;



Thanks, Anand
.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs-progs: Add mount point check for 'btrfs fi df' command

2014-07-07 Thread Vikram Goyal


On Fri, Jul 04, 2014 at 03:52:26PM +0200, David Sterba wrote:

On Fri, Jul 04, 2014 at 04:38:49PM +0800, Qu Wenruo wrote:

'btrfs fi df' command is currently able to be executed on any file/dir
inside btrfs since it uses btrfs ioctl to get disk usage info.

However it is somewhat confusing for some end users since normally such
command should only be executed on a mount point.


I disagree here, it's much more convenient to run 'fi df' anywhere and
get the output. The system 'df' command works the same way.


Just to clarify, in case my earlier mail did not convey the idea
properly.

The basic difference between traditional df & btrfs fi df is that
traditional df does not errors out when no arg is given &  outputs all the mounted FSes with their mount points. So to be
consistent, btrfs fi df should output all BTRFSes with mount points if
no arg is given.

Btrfs fi df insists for an arg  but does not clarifies in its
output if the given arg is a path inside of a mount point or is the
mount point itself, which can become transparent, if the mount point is
also shown in the output.

This is a just a request & a pointer to an oversight/anomaly but if the
developers do not feel in resonance with it right now then I just wish
that they keep it in mind, think about it & remove this confusion caused
by btrfs fi df as,when & how they feel fit.



The 'fi df' command itself is not that user friendly and the numbers
need further interpretation. I'm using it heavily during debugging and
restricting it to the mountpoint seems too artifical, the tool can cope
with that.

The 'fi usage' is supposed to give the user-friendly overview, but the
patchset is stuck because I found the numbers wrong or misleading under
some circumstances.

I'll reread the thread that motivated this patch to see if there's
something to address.


Thanks

--
vikram...


^^'^^||root||^^^'''^^
   // \\   ))
  //(( \\// \\
 // /\\ ||   \\
|| / )) ((\\
--
Rule of Life #1 -- Never get separated from your luggage.
--
 _
~|~
 =
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] btrfs: fix null pointer dereference in clone_fs_devices when name is null

2014-07-07 Thread Anand Jain





It's a pity that the patch has been merged into the upstream kernel.
Let's correct our miss before the next merge.


 What I found were new-bugs, those are not related to this patch.


BTW, I sent some patches to fix the problems about seed device(including
the updated patch of this one), could you try them and confirm that they
can fix the problems you said above or not?

[PATCH V2 7/9] btrfs: fix null pointer dereference in clone_fs_devices when 
name is null
[PATCH 8/9] Btrfs: fix unzeroed members in fs_devices when creating a fs from 
seed fs
[PATCH 9/9] Btrfs: fix writing data into the seed filesystem

This first one is the updated patch of this one.


 With 8,9/9 it fixes the new-bugs as well. Thanks.

Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] btrfs: fix null pointer dereference in clone_fs_devices when name is null

2014-07-07 Thread Anand Jain





It's a pity that the patch has been merged into the upstream kernel.
Let's correct our miss before the next merge.


 What I found were new-bugs, those are not related to this patch.


BTW, I sent some patches to fix the problems about seed device(including
the updated patch of this one), could you try them and confirm that they
can fix the problems you said above or not?

[PATCH V2 7/9] btrfs: fix null pointer dereference in clone_fs_devices when 
name is null
[PATCH 8/9] Btrfs: fix unzeroed members in fs_devices when creating a fs from 
seed fs
[PATCH 9/9] Btrfs: fix writing data into the seed filesystem

This first one is the updated patch of this one.


 With 8,9/9 it fixes the new-bugs as well. Thanks.

Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs loopback problems

2014-07-07 Thread Chris Samuel

On Mon, 7 Jul 2014 11:20:30 AM Qu Wenruo wrote:

> As Chris Mason mentioned, fixed in the following patch:
> https://patchwork.kernel.org/patch/4143821/

That should probably go to -stable (if it hasn't already), especially as 3.14 
is a new LTS kernel.

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

signature.asc
Description: This is a digitally signed message part.

39 matches

Mail list logo