Re: [PATCH] Btrfs: remove transaction from send

2014-03-15 Thread Hugo Mills
On Fri, Mar 14, 2014 at 10:44:04PM +, Hugo Mills wrote:
 On Fri, Mar 14, 2014 at 02:51:22PM -0400, Josef Bacik wrote:
  On 03/13/2014 06:16 PM, Hugo Mills wrote:
  On Thu, Mar 13, 2014 at 03:42:13PM -0400, Josef Bacik wrote:
  Lets try this again.  We can deadlock the box if we send on a box and try 
  to
  write onto the same fs with the app that is trying to listen to the send 
  pipe.
  This is because the writer could get stuck waiting for a transaction 
  commit
  which is being blocked by the send.  So fix this by making sure looking 
  at the
  commit roots is always going to be consistent.  We do this by keeping 
  track of
  which roots need to have their commit roots swapped during commit, and 
  then
  taking the commit_root_sem and swapping them all at once.  Then make sure 
  we
  take a read lock on the commit_root_sem in cases where we search the 
  commit root
  to make sure we're always looking at a consistent view of the commit 
  roots.
  Previously we had problems with this because we would swap a fs tree 
  commit root
  and then swap the extent tree commit root independently which would cause 
  the
  backref walking code to screw up sometimes.  With this patch we no longer
  deadlock and pass all the weird send/receive corner cases.  Thanks,
  
  There's something still going on here. I managed to get about twice
  as far through my test as I had before, but I again got an unexpected
  EOF in stream, with btrfs send returning 1. As before, I have this in
  syslog:
  
  Mar 13 22:09:12 s_src@amelia kernel: BTRFS error (device sda2): did not 
  find backref in send_root. inode=1786631, offset=825257984, 
  disk_byte=36504023040 found extent=36504023040\x0a
  
  
  I just noticed that the offset you have there is freaking gigantic,
  like 700mb, which is way larger than what an extent should be.  Here
  is a newer debug patch, just chuck the old on and put this instead
  and re-run
  
  http://paste.fedoraproject.org/85486/39482301
 
That last run, with the above patch, failed again, at approximately
 the same place again. The only output in dmesg is:
 
 [ 6488.168469] BTRFS error (device sda2): did not find backref in send_root. 
 inode=1786631, offset=825257984, disk_byte=36504023040 found 
 extent=36504023040, len=1294336

root@amelia:~# btrfs insp ino 1786631 /
//srv/vm/armand.img
root@amelia:~# ls -l /srv/vm/armand.img 
-rw-rw-r-- 1 root kvm 40 Jan 30 08:11 /srv/vm/armand.img
root@amelia:~# filefrag /srv/vm/armand.img
/srv/vm/armand.img: 17436 extents found

   This is a VM image, not currently operational. It probably has
sparse extents in it somewhere.

   The full filefrag -ev output is at [1], but the offset it's
complaining about is 825257984 = 201479 4k blocks:

 ext: logical_offset:physical_offset: length:   expected: flags:
17200:   201478..  201478:7220724..   7220724:  1:8923002:
17201:   201479..  201481:8912386..   8912388:  3:7220725:
17202:   201482..  201482:8923002..   8923002:  1:8912389:

   This seems unexceptional.

   Hugo.

[1] http://carfax.org.uk/files/temp/filefrag.txt

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Can I offer you anything? Tea? Seedcake? ---
 Glass of Amontillado?  


signature.asc
Description: Digital signature


Re: [PATCH] Btrfs: take into account total references when doing backref lookup V2

2014-03-19 Thread Hugo Mills
On Wed, Mar 19, 2014 at 01:35:14PM -0400, Josef Bacik wrote:
 I added an optimization for large files where we would stop searching for
 backrefs once we had looked at the number of references we currently had for
 this extent.  This works great most of the time, but for snapshots that point 
 to
 this extent and has changes in the original root this assumption falls on it
 face.  So keep track of any delayed ref mods made and add in the actual ref
 count as reported by the extent item and use that to limit how far down an 
 inode
 we'll search for extents.  Thanks,
 
 Reported-by: Hugo Mills h...@carfax.org.uk

Reported-by: Hugo Mills h...@carfax.org.uk

 Signed-off-by: Josef Bacik jba...@fb.com

Tested-by: Hugo Mills h...@carfax.org.uk

   Looks like it's worked. (Modulo the above typo in the metadata ;) )
I'll do a more complete test overnight.

   Hugo.

 ---
 V1-V2: Just use the extent ref count and any delayed ref counts, this will 
 work
 out right, whereas the shared thing doesn't work out in some cases.
 
  fs/btrfs/backref.c | 29 ++---
  1 file changed, 18 insertions(+), 11 deletions(-)
 
 diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
 index 0be0e94..10db21f 100644
 --- a/fs/btrfs/backref.c
 +++ b/fs/btrfs/backref.c
 @@ -220,7 +220,8 @@ static int __add_prelim_ref(struct list_head *head, u64 
 root_id,
  
  static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
  struct ulist *parents, struct __prelim_ref *ref,
 -int level, u64 time_seq, const u64 *extent_item_pos)
 +int level, u64 time_seq, const u64 *extent_item_pos,
 +u64 total_refs)
  {
   int ret = 0;
   int slot;
 @@ -249,7 +250,7 @@ static int add_all_parents(struct btrfs_root *root, 
 struct btrfs_path *path,
   if (path-slots[0] = btrfs_header_nritems(path-nodes[0]))
   ret = btrfs_next_old_leaf(root, path, time_seq);
  
 - while (!ret  count  ref-count) {
 + while (!ret  count  total_refs) {
   eb = path-nodes[0];
   slot = path-slots[0];
  
 @@ -306,7 +307,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
 *fs_info,
 struct btrfs_path *path, u64 time_seq,
 struct __prelim_ref *ref,
 struct ulist *parents,
 -   const u64 *extent_item_pos)
 +   const u64 *extent_item_pos, u64 total_refs)
  {
   struct btrfs_root *root;
   struct btrfs_key root_key;
 @@ -364,7 +365,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
 *fs_info,
   }
  
   ret = add_all_parents(root, path, parents, ref, level, time_seq,
 -   extent_item_pos);
 +   extent_item_pos, total_refs);
  out:
   path-lowest_level = 0;
   btrfs_release_path(path);
 @@ -377,7 +378,7 @@ out:
  static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info,
  struct btrfs_path *path, u64 time_seq,
  struct list_head *head,
 -const u64 *extent_item_pos)
 +const u64 *extent_item_pos, u64 total_refs)
  {
   int err;
   int ret = 0;
 @@ -403,7 +404,8 @@ static int __resolve_indirect_refs(struct btrfs_fs_info 
 *fs_info,
   if (ref-count == 0)
   continue;
   err = __resolve_indirect_ref(fs_info, path, time_seq, ref,
 -  parents, extent_item_pos);
 +  parents, extent_item_pos,
 +  total_refs);
   /*
* we can only tolerate ENOENT,otherwise,we should catch error
* and return directly.
 @@ -560,7 +562,7 @@ static void __merge_refs(struct list_head *head, int mode)
   * smaller or equal that seq to the list
   */
  static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq,
 -   struct list_head *prefs)
 +   struct list_head *prefs, u64 *total_refs)
  {
   struct btrfs_delayed_extent_op *extent_op = head-extent_op;
   struct rb_node *n = head-node.rb_node;
 @@ -596,6 +598,7 @@ static int __add_delayed_refs(struct 
 btrfs_delayed_ref_head *head, u64 seq,
   default:
   BUG_ON(1);
   }
 + *total_refs += (node-ref_mod * sgn);
   switch (node-type) {
   case BTRFS_TREE_BLOCK_REF_KEY: {
   struct btrfs_delayed_tree_ref *ref;
 @@ -656,7 +659,8 @@ static int __add_delayed_refs(struct 
 btrfs_delayed_ref_head *head, u64 seq,
   */
  static int __add_inline_refs(struct btrfs_fs_info *fs_info,
struct btrfs_path *path, u64 bytenr

Re: fresh btrfs filesystem, out of disk space, hundreds of gigs free

2014-03-22 Thread Hugo Mills
On Sat, Mar 22, 2014 at 06:21:02PM -0500, Jon Nelson wrote:
 Duncan 1i5t5.duncan at cox.net writes:
  Jon Nelson posted on Fri, 21 Mar 2014 19:00:51 -0500 as excerpted:
[snip]
   Below are the btrfs fi df /  and  btrfs fi show.
  
  
   turnip:~ # btrfs fi df /
   Data, single: total=1.80TiB, used=832.22GiB
   System, DUP: total=8.00MiB, used=204.00KiB
   System, single: total=4.00MiB, used=0.00
   Metadata, DUP: total=5.50GiB, used=5.00GiB
   Metadata, single: total=8.00MiB, used=0.00
 
  FWIW, the system and metadata single chunks reported there are an
  artifact from mkfs.btrfs and aren't used (used=0.00).  At some point it
  should be updated to remove them automatically, but meanwhile, a balance
  should remove them from the listing.  If you do that balance immediately
  after filesystem creation, at the first mount, you'll be rid of them when
  there's not a whole lot of other data on the filesystem to balance as
  well.  That would leave:
 
   Data, single: total=1.80TiB, used=832.22GiB
   System, DUP: total=8.00MiB, used=204.00KiB
   Metadata, DUP: total=5.50GiB, used=5.00GiB
 
  Metadata is the red-flag here.  Metadata chunks are 256 MiB in size, but
  in default DUP mode, two are allocated at once, thus 512 MiB at a time.
  And you're under 512 MiB free so you're running on the last pair of
  metadata chunks, which means depending on the operation, you may need to
  allocate metadata pretty quickly.  You can probably copy a few files
  before that, but a big copy operation with many files at a time would
  likely need to allocate more metadata.
 
 The size of the chunks allocated is especially useful information. I've not
 seen that anywhere else, and does explain a fair bit.
 
  But for a complete picture you need the filesystem show output, below, as
  well...
 
   turnip:~ # btrfs fi show
   Label: none  uuid: 9379c138-b309-4556-8835-0f156b863d29
   Total devices 1 FS bytes used 837.22GiB
   devid1 size 1.81TiB used 1.81TiB path /dev/sda3
  
   Btrfs v3.12+20131125
 
  OK.  Here we see the root problem.  Size 1.81 TiB, used 1.81 TiB.  No
  unallocated space at all.  Whichever runs out of space first, data or
  metadata, you'll be stuck.
 
 Now it's at this point that I am unclear. I thought the above said:
 1 device on this filesystem, 837.22 GiB used.
 and
 device ID #1 is /dev/sda3, is 1.81TiB in size, and btrfs is using 1.81TiB
 of that
 
 Which I interpret differently. Can you go into more detail as to how (from
 btrfs fi show) we can say the _filesystem_ (not the device) is full?

   From btrfs fi show on its own, you can't. The problem is that the
data/metadata split means that the metadata has run out, and there's
(currently -- see below) no way of reassigning some of the data
allocation to metadata. So the disk full condition is complete
allocation (see btrfs fi show) *and* metadata near-full (see btrfs
fi df).

   An interesting question here is how come the FS allocated all that
space to data when it's a newly-made filesystem with less than half
that space actually used -- did you write lots of other data to it and
then delete it again? If not, I haven't seen overallocation like that
that since 3.9 or so, and it would be good to know what happened.

[snip]
  Meanwhile, I strongly urge you to read up on the btrfs wiki.  The
  following is easy to remember and bookmark:
 
 I read the wiki and related pages many times, but there is a lot of info
 there and I must have skipped over the if your device is large section.
 
 To be honest, it seems like a lot of hoop-jumping and a maintenance burden
 for the administrator. Not being able to draw from free space pool for
 either data or metadata seems like a big bummer. I'm hoping that such a
 limitation will be resolved at some near-term future point.

   It's certainly something that's been discussed in the past. I think
Ilya had automatic reclamation of unused allocation (e.g. an autonomic
balance / reallocation) on his to-do list at one point. I don't know
what the status of the work is, though.

[snip]

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Alert status chocolate viridian: Authorised personnel only. ---   
   Dogs must be carried on escalator.


signature.asc
Description: Digital signature


Re: ERROR: error during balancing '.' - No space left on device

2014-03-23 Thread Hugo Mills
On Sun, Mar 23, 2014 at 12:01:44AM -0700, Marc MERLIN wrote:
 legolas:/mnt/btrfs_pool2# btrfs balance .
 ERROR: error during balancing '.' - No space left on device
 There may be more info in syslog - try dmesg | tail
 [ 8454.159635] BTRFS info (device dm-1): relocating block group 288329039872 
 flags 1
 [ 8590.167294] BTRFS info (device dm-1): relocating block group 232494465024 
 flags 1
 [ 9200.801177] BTRFS info (device dm-1): relocating block group 85928706048 
 flags 1
 [ 9533.830623] BTRFS info (device dm-1): 824 enospc errors during balance
 
 But:
 legolas:/mnt/btrfs_pool2# btrfs fi show `pwd`
 Label: btrfs_pool2  uuid: 6afd4707-876c-46d6-9de2-21c4085b7bed
   Total devices 1 FS bytes used 646.41GiB
   devid1 size 820.45GiB used 820.45GiB path /dev/mapper/disk2
 Btrfs v3.12
 legolas:/mnt/btrfs_pool2# 
 Data, single: total=800.42GiB, used=636.91GiB
 System, DUP: total=8.00MiB, used=92.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, DUP: total=10.00GiB, used=9.50GiB

  ^^^ This is where you're full. There's a block reserve here that
  (should be) usable for doing a balance, and it seems to be about 500
  MiB or so that's the point at which problems show up.

 Metadata, single: total=8.00MiB, used=0.00
 
 I can't see how I'm full, and now that I can't run balance to fix
 things, this is making things worse.

   I think you probably shouldn't be doing a full balance, but a
filtered one:

# btrfs balance start -dusage=5 /mnt/btrfs_pool

which should only try to clean up chunks which have little usage (so
it's much faster to run).

 Kernel is 3.14.
 
 What am I missing?

   Not much. We do seem to have a problem with not being able to run
balance in recent kernels under some circumstances -- you're not the
only person who's reported this kind of problem lately.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Don't worry, he's not drunk. He's like that all the time. ---


signature.asc
Description: Digital signature


Re: btrfs-tools missing btrfs device delete devid=x path ?

2014-03-23 Thread Hugo Mills
On Sun, Mar 23, 2014 at 08:25:17AM -0700, Marc MERLIN wrote:
 I'm still doing some testing so that I can write some howto.
 
 I got that far after a rebalance (mmmh, that took 2 days with little
 data, and unfortunately 5 deadlocks and reboots.
 
 polgara:/mnt/btrfs_backupcopy# btrfs fi show
 Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
 Total devices 11 FS bytes used 114.35GiB
 devid1 size 465.76GiB used 32.14GiB path /dev/dm-0
 devid2 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdd1
 devid3 size 465.75GiB used 0.00 path    drive is freed up 
 now.
 devid4 size 465.76GiB used 32.14GiB path /dev/dm-2
 devid5 size 465.76GiB used 32.14GiB path /dev/dm-3
 devid6 size 465.76GiB used 32.14GiB path /dev/dm-4
 devid7 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdi1
 devid8 size 465.76GiB used 32.14GiB path /dev/dm-6
 devid9 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdk1
 devid10 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdl1
 devid11 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sde1
 Btrfs v3.12
 
 What's the syntax for removing a drive that isn't there?

   btrfs dev del missing /path

   Removes all the missing devices.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Anyone using a computer to generate random numbers is, of ---
   course,  in a state of sin.   


signature.asc
Description: Digital signature


Re: ERROR: error during balancing '.' - No space left on device

2014-03-23 Thread Hugo Mills
On Sun, Mar 23, 2014 at 09:20:00AM -0700, Marc MERLIN wrote:
 Both
 legolas:/mnt/btrfs_pool2# btrfs balance start -v -dusage=5 /mnt/btrfs_pool2
 legolas:/mnt/btrfs_pool2# btrfs balance start -v -dusage=0 /mnt/btrfs_pool2
 failed unfortunately.
 
 On Sun, Mar 23, 2014 at 12:26:32PM +, Duncan wrote:
  When it rains, it pours.  What you're missing is that this is now the 
  third thread in three days with exactly the same out-of-space-when-there-
  appears-to-be-plenty problem, which is well explained and a solution 
  presented, along with further discussion, on those threads.
  
  Evidently you haven't read the others, but rather than rewrite a similar 
  reply here with exactly the same explanation and fix, I'll just refer you 
  to them.
 
 Thanks. Indeed, while I spent most of yesterday dealing with 3 btrfs
 filesystems, the one here that was hanging my laptop, the raid5 one that
 was hanging repeatedly during balance, and then my main server were one
 FS is so slow that it takes 8H to do an reflink copy or delete a backup
 with 1 million inodes, I got behind on reading the list :)
 
 Thanks for the pointers
 
  btrfs balance start -dusage=5 `pwd`
  
  Tweak the N in usage=N as needed.
 
 I had actually tried this, but it failed too:
 legolas:/mnt/btrfs_pool2# btrfs balance start -v -dusage=5 /mnt/btrfs_pool2
 Dumping filters: flags 0x1, state 0x0, force is off
   DATA (flags 0x2): balancing, usage=5
 ERROR: error during balancing '/mnt/btrfs_pool2' - No space left on device
 
 But I now just found
 https://btrfs.wiki.kernel.org/index.php/Balance_Filters
 and tried -dusage=0
  
 On Sun, Mar 23, 2014 at 11:47:12AM +, Hugo Mills wrote:
 I think you probably shouldn't be doing a full balance, but a
  filtered one:
  
  # btrfs balance start -dusage=5 /mnt/btrfs_pool
  
  which should only try to clean up chunks which have little usage (so
  it's much faster to run).
 
 Thanks for the other answer Hugo.
 
 So, now I'm down to 
 legolas:/mnt/btrfs_pool2# btrfs balance start -v -dusage=0 /mnt/btrfs_pool2
 Dumping filters: flags 0x1, state 0x0, force is off
   DATA (flags 0x2): balancing, usage=0
 ERROR: error during balancing '/mnt/btrfs_pool2' - No space left on device
 
 Looks like there is no good way out of this, so I'll start deleting
 snapshots.

   Before you do this, can you take a btrfs-image of your metadata,
and add a report to bugzilla.kernel.org? You're not the only person
who's had this problem recently, and I suspect there's something
still lurking in there that needs attention.

 Hopefully this will be handled better in later code.

   With the info you can provide from a btrfs-image... let's hope so. :)

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Anyone using a computer to generate random numbers is, of ---
   course,  in a state of sin.   


signature.asc
Description: Digital signature


Re: ERROR: error during balancing '.' - No space left on device

2014-03-23 Thread Hugo Mills
On Sun, Mar 23, 2014 at 10:03:14AM -0700, Marc MERLIN wrote:
 On Sun, Mar 23, 2014 at 04:28:25PM +, Hugo Mills wrote:
 Before you do this, can you take a btrfs-image of your metadata,
  and add a report to bugzilla.kernel.org? You're not the only person
  who's had this problem recently, and I suspect there's something
  still lurking in there that needs attention.
  
   Hopefully this will be handled better in later code.
  
 With the info you can provide from a btrfs-image... let's hope so. :)
 
 Mmmh, this may not be good :-/
 
 legolas:/mnt/btrfs_pool2# btrfs-image -c 9 -t 6 /dev/mapper/disk2 
 /tmp/pool2.image
 parent transid verify failed on 295965446144 wanted 51493 found 51495
 parent transid verify failed on 295965446144 wanted 51493 found 51495
 parent transid verify failed on 295965446144 wanted 51493 found 51495
 parent transid verify failed on 295965446144 wanted 51493 found 51495
 Ignoring transid failure
 leaf parent key incorrect 295965446144
 parent transid verify failed on 106205184 wanted 51468 found 51528
 parent transid verify failed on 106205184 wanted 51468 found 51528
 parent transid verify failed on 106205184 wanted 51468 found 51528
 parent transid verify failed on 106205184 wanted 51468 found 51528
[snip]
 Segmentation fault
 
 Is it a bug in the tool, or do I have real corruption?
 
 Are there magic options I can give it to make it work around this?

   xaba on IRC has just pointed out that it looks like you're running
this on a mounted filesystem -- it needs to be unmounted for
btrfs-image to work reliably.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Unix: For controlling fungal diseases in crops. --- 


signature.asc
Description: Digital signature


Re: for Chris Mason ( iowatcher graphs)

2014-03-23 Thread Hugo Mills
On Sun, Mar 23, 2014 at 09:36:19PM +0400, Vasiliy Tolstov wrote:
 Hello. Sorry for writing to btrfs mailing list, but personal mail
 reject my message.
 Saying 
 chris.ma...@fusionio.com: host 10.101.1.19[10.101.1.19] said: 554 5.4.6 Hop
 count exceeded - possible mail loop (in reply to end of DATA command)

   He's moved to Facebook now.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Unix: For controlling fungal diseases in crops. --- 


signature.asc
Description: Digital signature


Re: Any use for mkfs.btrfs -d raid5 -m raid1 ?

2014-03-23 Thread Hugo Mills
On Sun, Mar 23, 2014 at 03:44:35PM -0700, Marc MERLIN wrote:
 If I lose 2 drives on a raid5, -m raid1 should ensure I haven't lost my
 metadate.
 From there, would I indeed have small files that would be stored entirely on
 some of the drives that didn't go missing, and therefore I could recover
 some data with 2 missing drives?

   btrfs's RAID-1 is two copies only, so you may well have lost some
of your metadata. n-copies RAID-1 is coming Real Soon Now™ (Chris has
it on his todo list, along with fixing all the parity RAID stuff).

 Or is it kind of pointless/waste of space?
 
 Actually, would it make btrfs faster for metadata work since it can read
 from n drives in parallel and get data just a bit faster, or is that mostly
 negligeable?

   I don't think we've got good benchmarks from anyone on any of this
kind of thing.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Great oxymorons of the world, no. 9: Standard Deviation --- 


signature.asc
Description: Digital signature


Re: [PATCH v2] btrfs-progs: allow use of subvolume id to create snapshots

2014-03-25 Thread Hugo Mills
 created snapshot will be readonly.
 +.IP \fB-i\fP \fIqgroupid\fR 5
 +Add the newly created subvolume to a qgroup. This option can be given 
 multiple
 +times.
 +.RE
 +.TP
 +
 +\fBsubvolume snapshot\fP [-r] [-i qgroupid] \fI-s subvolid\fP 
 \fIdest\fP/\fIname\fP
 +Create a writable/readonly snapshot of the subvolume \fIsubvolid\fR with 
 the
 +name \fIname\fR in the \fIdest\fR directory.
 +If \fIsubvolid\fR does not refer to a subvolume, \fBbtrfs\fR returns an 
 error.
 +.RS
 +
 +\fIOptions\fP
 +.IP \fB-r\fP 5
 +The newly created snapshot will be readonly.
 +.IP \fB-i\fP \fIqgroupid\fR 5
 +Add the newly created subvolume to a qgroup. This option can be given 
 multiple
 +times.
 +.RE
  .TP
  
  \fBsubvolume get-default\fR\fI path\fR

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- vi: The core of evil. ---  


signature.asc
Description: Digital signature


Re: free space inode generation (0) did not match free space cache generation

2014-03-25 Thread Hugo Mills
On Tue, Mar 25, 2014 at 09:03:26PM +0100, Hendrik Friedel wrote:
 Hi,
 
 Well, given the relative immaturity of btrfs as a filesystem at this
 point in its lifetime, I think it's acceptable/tolerable.  However, for a
 filesystem feted[1] to ultimately replace the ext* series as an assumed
 Linux default, I'd definitely argue that the current situation should be
 changed such that btrfs can automatically manage its own de-allocation at
 some point, yes, and that said some point really needs to come before
 that point at which btrfs can be considered an appropriate replacement
 for ext2/3/4 as the assumed default Linux filesystem of the day.
 
 Agreed! I hope, this is on the ToDo List?!

https://btrfs.wiki.kernel.org/index.php/Project_ideas#Block_group_reclaim

   Yes. :)

 [1] feted: celebrated, honored.  I had to look it up to be sure my
 intuition on usage was correct, and indeed I had spelled it wrong
 
 :-)

   Did you mean fated: intended, destined?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- IMPROVE YOUR ORGANISMS!!  -- Subject line of spam email --- 


signature.asc
Description: Digital signature


Re: free space inode generation (0) did not match free space cache generation

2014-03-25 Thread Hugo Mills
On Tue, Mar 25, 2014 at 09:28:20PM +, Duncan wrote:
 Hugo Mills posted on Tue, 25 Mar 2014 20:10:20 + as excerpted:
 
  Did you mean fated: intended, destined?
 
 No, I meant feted, altho I understand in Europe the first e would 
 likely have a carot-hat (fêted), but us US-ASCII folks don't have such a 
 thing easily available, so unless I copy/paste as I just did or use 
 charselect, feted without the carot it is.

   Either word works in the context -- I wasn't knocking you at all. I
was just testing the fit of the homophone (particularly since you'd
mentioned checking the spelling).

 Where I've seen feted used it tends to have a slightly future-
 predictive hint to it, something that's considered a shoe-in to use 

   Or a shoo-in... :)

 another term, but that isn't necessarily certain just yet.  Alternatively 
 or as well, it can mean something that many or the majority considers/
 celebrates as true, but that the author isn't necessarily taking a 
 particular position on at this time, perhaps as part of the traditional 
 journalist's neutral observer's perspective, saying other people 
 celebrate it as, without personally 100% endorsing the same position.
 
 Which fit my usage exactly.  I wanted to indicate that btrfs' position as 
 a successor to the ext3/4 throne is a widely held expectation, but that 
 while I agree with the general sentiment, it's with a wait and see if/
 when these few details get fixed attitude, because I don't think that a 
 btrfs that a knowledgeable admin must babysit in ordered to be sure it 
 doesn't run out of unallocated chunks, for example, is quite ready for 
 usage by the masses, that is, to take the throne as crowned successor 
 to ext3/4 just yet.  And feted seemed the perfect word to express and 
 acknowledge that expectation, while at the same time conveying my slight 
 personal reservation.

   Ack. There's a number of sharp edges like this hanging around.
Those of us who've been here for a while don't tend to notice them (or
at least, deprioritise them), and it's a good thing to have people
saying do I really have to do this crap? occasionally.

   Hugo.

 In fact, until I looked up the word I had no idea the word could also be 
 used as a noun in addition to my usage as a verb, and used as a noun, 
 that it meant a feast, celebration or carnival.  I was familiar only with 
 the usage I demonstrated here, including the slight hint of third party 
 neutrality or wait-and-see reservation, which was in fact my reason for 
 choosing the term in the first place.
 
 (This is of course one reason I so enjoy newsgroups and mailing lists.  
 One never knows what sort of entirely unpredicted but useful thing one 
 might learn from them, even in my own replies sometimes! =:^)


-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Great oxymorons of the world, no. 10: Business Ethics ---  


signature.asc
Description: Digital signature


Re: Inappropriate ioctl

2014-03-26 Thread Hugo Mills
On Wed, Mar 26, 2014 at 10:05:19PM +0100, Johannes Stemmler wrote:
 Sorry for reporting this issue to you, but I have not found any helpful
 information elsewhere.
 
 I have installed a new opensuse-system 12.3 from scratch and selected an
 ext4-fs for /
 and btrfs for /home
 The installation works and the access to the /home btrfs works also.
 
 But the btrfs-progs doesn't.
 
 show works
 # btrfs filesystem show /dev/sda8
 Label: lx_home  uuid: 5a6857e3-b421-4a68-92f4-0c7e5f9fcb4c
 Total devices 1 FS bytes used 74.40GiB
 devid1 size 262.63GiB used 77.04GiB path /dev/sda8
 
 df does not work
 # btrfs filesystem df /dev/sda8

   ^^^ this needs to be a location of the *mounted* filesystem, not a
device. Most of the btrfs functions do this, although not all of them
-- check the man pages, or the online help, which should state either
device or mountpoint.

   Hugo.

 ERROR: couldn't get space info - Inappropriate ioctl for device
 ERROR: get_df failed Inappropriate ioctl for device
 
 #mount
 /dev/sda6 on / type ext4 (rw,relatime,data=ordered)
 /dev/sda8 on /home type btrfs (rw,relatime,ssd,space_cache)
 
 A reinstallation of the btrfs-progs did not help.
 Is the system bad packaged by opensuse or is my constellation forbidden?
 
 Best regards,
 Johannes Stemmler
 
 p.s. the btrfs is on a ssd-partition

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Well, you don't get to be a kernel hacker simply by looking ---   
good in Speedos. -- Rusty Russell


signature.asc
Description: Digital signature


Re: RHEL/CentOS or Debian for stable deployment

2014-03-28 Thread Hugo Mills
On Fri, Mar 28, 2014 at 04:38:09PM -0700, Lists wrote:
 On 03/28/2014 02:42 PM, Avi Miller wrote:
 Have you considered Oracle Linux? We are continually backporting btrfs fixes 
 and enhancements to our Unbreakable Enterprise Kernel releases. On Oracle 
 Linux 6, you would run the UEK Release 3, which is based on 3.8 mainline 
 with upstream fixes. We also provide all security and bug fix errata for 
 free viahttp://public-yum.oracle.com, so you don’t need to buy support to 
 run Oracle Linux at keep up-to-date.
 
 Can't remember asking if btrfs is supported on 32 bit kernels?

   I can't speak for Oracle's distribution, but in the general case,
yes, btrfs works on 32 bit systems.

   On the subject of bitness, there's a couple of rough edges with
64-bit kernels and 32-bit userspace. (I think there's still one ioctl
that fails on that configuration, but I've not hit it yet on my test
machine).

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- My doctor tells me that I have a malformed public-duty gland, ---  
and a natural deficiency in moral fibre. 


signature.asc
Description: Digital signature


Re: RHEL/CentOS or Debian for stable deployment

2014-03-29 Thread Hugo Mills
On Sat, Mar 29, 2014 at 05:18:25PM -0700, Marc MERLIN wrote:
 On Fri, Mar 28, 2014 at 11:45:03PM +, Hugo Mills wrote:
  On Fri, Mar 28, 2014 at 04:38:09PM -0700, Lists wrote:
   On 03/28/2014 02:42 PM, Avi Miller wrote:
   Have you considered Oracle Linux? We are continually backporting btrfs 
   fixes and enhancements to our Unbreakable Enterprise Kernel releases. On 
   Oracle Linux 6, you would run the UEK Release 3, which is based on 3.8 
   mainline with upstream fixes. We also provide all security and bug fix 
   errata for free viahttp://public-yum.oracle.com, so you don’t need to 
   buy support to run Oracle Linux at keep up-to-date.
   
   Can't remember asking if btrfs is supported on 32 bit kernels?
  
 I can't speak for Oracle's distribution, but in the general case,
  yes, btrfs works on 32 bit systems.
  
 On the subject of bitness, there's a couple of rough edges with
  64-bit kernels and 32-bit userspace. (I think there's still one ioctl
  that fails on that configuration, but I've not hit it yet on my test
  machine).
 
 btrfs send does not work with 32bit userland and 64bit kernel when I
 last tried it a few weeks ago.

 Thankfully I had debian, so I was able to upgrade just btrfs-tools to
 64bit without upgrading my entire system, and that solved it.

   That's the one I know about and patched some weeks ago :) (it
should be in btrfs-next by now).

   Hugo.


-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- What do you give the man who has everything? -- Penicillin is ---  
 a good start... 


signature.asc
Description: Digital signature


Re: btrfs send/receive still gets out of sync in 3.14.0

2014-03-30 Thread Hugo Mills
On Sat, Mar 29, 2014 at 08:22:02PM -0700, Marc MERLIN wrote:
 On Sat, Mar 22, 2014 at 02:04:56PM -0700, Marc MERLIN wrote:
  After deleting a huge directory tree in my /home subvolume, syncing
  snapshots now fails with:
  
  ERROR: rmdir o1952777-157-0 failed. No such file or directory
 
 So, I'm ok again after I deleted my destination snapshot and re-init'ed,
 but on multi terabyte backups, this ain't great :)
 
 Do I need to file a bug that btrfs send/receive still gets out of sync
 in 3.14, or is it already known and maybe even fixed in btrfs-next?

   Filipe has been posting a series of patches related to send/receive
recently, so this may be related to those bugs.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Gort!  Klaatu barada nikto! ---   


signature.asc
Description: Digital signature


Re: BTRFS setup advice for laptop performance ?

2014-04-04 Thread Hugo Mills
On Fri, Apr 04, 2014 at 10:02:27AM +0200, Swâmi Petaramesh wrote:
 Hi,
 
 I'm going to receive a new small laptop with a 500 GB 5400 RPM mechanical 
 ole' rust  HD, and I plan ton install BTRFS on it.
 
 It will have a kernel 3.13 for now, until 3.14 gets released.
 
 However I'm still concerned with chronic BTRFS dreadful performance and still 
 find that BRTFS degrades much over time even with periodic defrag and best 
 practices etc.

   There's something funny going on here. There are, apparently, a
reasonable number of people using btrfs in daily use, with things like
snapper (regular and frequent snapshots). I'm one of them, although I
don't use snapper. We don't have lots of reports of massive slowdowns
after a long period of use, so whatever you're doing, there seems to
be something unusual involved.

   It's almost certainly not your fault, but there would appear to be
something in your configuration or your use-case which is leading to
these problems, and without knowing what's different, it's hard to set
about identifying the problem.

   What software do you run on the machine? Browser? Any databases?
Anything that contains a database? Torrents or other filesharing
software? Bitcoin mining? Bitcoin wallet? Anything else beyond the
ordinary boring desktop/office type applications? Are you compiling
lots of things (e.g. Gentoo)? Creating and deleting lots of files? If
so, large ones or small ones? Are you running very close to a full
filesystem? How are you measuring the slowdown -- do you have a
specific piece of benchmarking software, or just anecdotal evidence?

 So I'd like to start with the best possible options and have a few questions :
 
 - Is it still recommended to mkfs with a nodesize or leafsize different 
 (bigger) than the default ? I wouldn't like to lose too much disk space 
 anyway 
 (1/2 nodesize per file on average ?), as it will be limited...

   No, nodes are used for the metadata trees, not for file storage.
I'd suggest nodesize=leafsize=16k or 32k. I don't think you can change
the block size at the moment.

 - Is it recommended to alter the FS to have skinny extents ? I've
 done this on all of my BTRFS machines without problem, still the
 kernel spits a notice at mount time, and I'm worrying kind of Why
 is the kernel warning me I have skinny extents ? Is it bad ? Is it
 something I should avoid ?

   As far as I know, they're considered safe and stable. I suspect
that the message is just a developer info thing that hasn't been taken
out yet.

 - Are there other optimization tricks I should perform at mkfs time because 
 thay can't be changed later on ?

   Nodesize/leafsize are the only things you should probably change at
mkfs time. The other thing would be --mixed, but you probably don't
want that on a 500 GiB drive.

 - Are there other btrfstune or mount options I should pass before
 starting to populate the FS with a system and data ?

   I think everything else other than the above can be done after the
fact with btrfstune. I'd definitely suggest extended inode refs simply
because it fixes a known limitation.

 - Generally speaking, does LZO compression improve or degrade performance ? 
 I'm not able to figure it out clearly.

   Yes, it improves or degrades performance. :)

   It'll depend entirely on what you're doing with it. If you're
storing lots of zeroes (Phoronix, I'm looking at you), then you'll get
huge speedups. If you're storing video data, you'll get a (very)
slight performance drop as it scompresses the first few blocks of the
file and then gives up. I suspect that in general, the performance
differences won't be noticable unless you have highly compressible
large files, but if you _really_ care about it, benchmark it(*).

   Hugo.

(*) If you don't want to go through the effort of benchmarking, you
don't care enough about it, and should just pick something at random.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- And what rough beast,  its hour come round at last / slouches ---  
 towards Bethlehem,  to be born? 


signature.asc
Description: Digital signature


Re: BTRFS send/receive limitations

2014-04-04 Thread Hugo Mills
On Fri, Apr 04, 2014 at 09:50:05AM -0700, Lists wrote:
 I read recently that you can't send/receive concurrent streams on the same
 filesystem, which begs the question of what is meant by a filesystem. Is
 that to say that you can't send/receive snapshots on different subvolumes to
 the same root filesystem? Or that you can't send/receive multiple
 snapshots on the same subvolume? Can you send/receive a snapshot or
 subvolume to the same root filesystem?

   The restriction was on the same *filesystem* as a whole: there was
a global lock on the whole FS, which could cause deadlocks with send
and receive both accessing the same FS (any subvolumes). I don't
recall hearing about problems with two sends from different subvols on
the same FS, but that might just be because I wasn't paying attention.
:)

   I think those restrictions are gone now, in some patch in the
pipeline. Possibly for 3.15 -- I'm not sure if the patches made it
into 3.14.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Gomez, darling, don't torture yourself.  That's my job. --- 


signature.asc
Description: Digital signature


Re: BTRFS setup advice for laptop performance ?

2014-04-05 Thread Hugo Mills
On Sat, Apr 05, 2014 at 01:10:13PM +0200, Swâmi Petaramesh wrote:
 Le samedi 5 avril 2014 10:12:17 Duncan wrote [excellent performance advice 
 about disabling Akonadi in BTRFS etc]:
 
 Thanks Duncan for all this excellent discussion.
 
 However I'm still rather puzzled with a filesystem for which advice is if 
 you 
 want tolerable performance, you have to turn off features that are the 
 default 
 with any other FS out there (relatime - noatime) or you have to quit using 
 this database, or you have to fiddle around with esoteric option such as 
 disabling COW wich BTW is one of BTRFS most promiment features.

   OK, a couple of points here:

 - For the things where you should turn CoW off, they're typically
   things like databases that do their _own_ CoW handling or similar
   high-performance reliable transactions/writes, so you generally
   lose very little there.

 - I'm not aware, particularly, of any major differences between
   noatime and relatime in performance on btrfs. (But I may be wrong
   there).

 - Given Duncan's discussion of the performance of the semantic
   desktop, I would suggest turning it off *temporarily* to see if it
   really is where the difficulty lies. If it turns out that it's
   unrelated and things still slow down horribly, then at least we've
   knocked down one theory and need to look elsewhere. If it _is_
   related, then that at least gives us a reproducer for the problem,
   and the people who are skilled in tracking down performance
   problems have something to look at. It also means that you have a
   range of things you know you can try if the problem gets really bad
   (maybe delete the database and rebuild it regularly? mark parts of
   it nodatacow? maybe autodefrag helps? maybe it's something simple
   the authors of the database can change?).

[snip]
 I need a filesystem that fits me, I don't want to have to fit my filesystem 
 :-\

   If you truly find btrfs unusable -- which you've said at various
points in the past -- then I'm not going to suggest that you keep
using it. Maybe something else is genuinely better for you. It's not
in the interests of the btrfs community to recommend that people use
btrfs when it's not appropriate. That said, it would be good to have
your help to try to fix the (apparently quite unusual) problems you're
seeing. Part of that is tracking down which bit of software is
triggering the issues, and helping to identify what the issues
actually are. Sometimes, the hardest problem in fixing bugs is finding
someone who can reproduce the bug and test fixes.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Ceci n'est pas une pipe:  | ---   


signature.asc
Description: Digital signature


Re: [PATCH 24/27] btrfs-progs: Convert man page for btrfs-zero-log

2014-04-05 Thread Hugo Mills
On Sat, Apr 05, 2014 at 04:00:27PM -0600, cwillu wrote:
 On Fri, Apr 4, 2014 at 12:46 PM, Marc MERLIN m...@merlins.org wrote:
  On Wed, Apr 02, 2014 at 04:29:35PM +0800, Qu Wenruo wrote:
  Convert man page for btrfs-zero-log
 
  Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
  ---
   Documentation/Makefile   |  2 +-
   Documentation/btrfs-zero-log.txt | 39 
  +++
   2 files changed, 40 insertions(+), 1 deletion(-)
   create mode 100644 Documentation/btrfs-zero-log.txt
 
  diff --git a/Documentation/Makefile b/Documentation/Makefile
  index e002d53..de06629 100644
  --- a/Documentation/Makefile
  +++ b/Documentation/Makefile
  @@ -11,7 +11,7 @@ MAN8_TXT += btrfs-image.txt
   MAN8_TXT += btrfs-map-logical.txt
   MAN8_TXT += btrfs-show-super.txt
   MAN8_TXT += btrfstune.txt
  -#MAN8_TXT += btrfs-zero-log.txt
  +MAN8_TXT += btrfs-zero-log.txt
   #MAN8_TXT += fsck.btrfs.txt
   #MAN8_TXT += mkfs.btrfs.txt
 
  diff --git a/Documentation/btrfs-zero-log.txt 
  b/Documentation/btrfs-zero-log.txt
  new file mode 100644
  index 000..e3041fa
  --- /dev/null
  +++ b/Documentation/btrfs-zero-log.txt
  @@ -0,0 +1,39 @@
  +btrfs-zero-log(8)
  +=
  +
  +NAME
  +
  +btrfs-zero-log - clear out log tree
  +
  +SYNOPSIS
  +
  +'btrfs-zero-log' dev
  +
  +DESCRIPTION
  +---
  +'btrfs-zero-log' will remove the log tree if log tree is corrupt, which 
  will
  +allow you to mount the filesystem again.
  +
  +The common case where this happens has been fixed a long time ago,
  +so it is unlikely that you will see this particular problem.
 
  A note on this one: this can happen if your SSD rites things in the
  wrong order or potentially writes garbage when power is lost, or before
  locking up.
  I hit this problem about 10 times and it wasn't a btrfs bug, just the
  drive doing bad things.
 
 And -o recovery didn't work around it?  My understanding is that -o
 recovery will skip reading the log.

   No, I'm pretty sure we've had people with problems with the log
where -orecovery didn't help, but -oro,recovery allowed it to be
mounted, because -ro didn't try to replay the log.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- If the first-ever performance is the première,  is the --- 
  last-ever performance the derrière?   


signature.asc
Description: Digital signature


Re: [PATCH 24/27] btrfs-progs: Convert man page for btrfs-zero-log

2014-04-05 Thread Hugo Mills
On Sat, Apr 05, 2014 at 03:02:03PM -0700, Marc MERLIN wrote:
 On Sat, Apr 05, 2014 at 04:00:27PM -0600, cwillu wrote:
   +'btrfs-zero-log' will remove the log tree if log tree is corrupt, which 
   will
   +allow you to mount the filesystem again.
   +
   +The common case where this happens has been fixed a long time ago,
   +so it is unlikely that you will see this particular problem.
  
   A note on this one: this can happen if your SSD rites things in the
   wrong order or potentially writes garbage when power is lost, or before
   locking up.
   I hit this problem about 10 times and it wasn't a btrfs bug, just the
   drive doing bad things.
  
  And -o recovery didn't work around it?  My understanding is that -o
  recovery will skip reading the log.
 
 Maybe it does, but if you're trying to mount your root filesystem to boot
 your laptop, that's not super useful since -o recovery is indeed a read only
 recovery mode.
 btrfs-zero-log just cleans the last log entry and gave me back a fully working
 read/write filesystem each time.

   As far as I recall, -orecovery is read-write. -oro,recovery is
read-only.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Dullest spy film ever: The Eastbourne Ultimatum --- 


signature.asc
Description: Digital signature


Re: Scrub bug on kernel 2.13

2014-04-07 Thread Hugo Mills
On Mon, Apr 07, 2014 at 10:32:04AM +0200, Swâmi Petaramesh wrote:
 Hi there,
 
 Machine got rebooted while scrub was in process, and now it looks like a 
 scrub 
 zombie...
 
 How do I restore this to a normal non-zombie state ?

   There's a status file in /var/lib/btrfs (at least, it's somewhere
near there -- I think that's it, though). Delete that, and you should
be OK. It's a known bug, and I'm fairly sure it was fixed in the
userspace tools some time ago. What version of the tools are you
using?

   Hugo.

 root@zafu:~# btrfs scrub status /
 scrub status for 13c87f57-3a85-4daf-a4bf-ba777407c169
 scrub started at Mon Apr  7 09:49:48 2014, running for 693 seconds
 total bytes scrubbed: 34.06GiB with 0 errors
 
 root@zafu:~# btrfs scrub cancel /
 ERROR: scrub cancel failed on /: not running
 
 root@zafu:~# btrfs scrub start /
 ERROR: scrub is already running.
 To cancel use 'btrfs scrub cancel /'.
 To see the status use 'btrfs scrub status [-d] /'.
 
 root@zafu:~# btrfs scrub status /
 scrub status for 13c87f57-3a85-4daf-a4bf-ba777407c169
 scrub started at Mon Apr  7 09:49:48 2014, running for 693 seconds
 total bytes scrubbed: 34.06GiB with 0 errors
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Questions are a burden, and answers a prison for oneself. ---


signature.asc
Description: Digital signature


Re: Using noCow with snapshots ?

2014-04-09 Thread Hugo Mills
On Wed, Apr 09, 2014 at 01:15:24PM +0200, Swâmi Petaramesh wrote:
 Hi,
 
 In the quest for BTRFS and performance, and having received the advice to 
 chattr +C my akonadi DB directory to make it noCow, I would like to be sure 
 about what will happen when I take a snapshot of the concerned BTRFS 
 subvolume.
 
 1/ Being noCow, will the database be modified in the snapshot as well, 
 efectively defeating the snapshot ?

   No (see below)

 2/ Being snapshotted, will the database be COWed even though it's
 supposed to be noCow ?

   Yes -- once.

   When you make a snapshot of a nodatacow file, the data is shared
between the snapshot and the original as normal. The extents are
reference counted, so the original data now has two references to it.

   When one of these copies is written to, the writes are placed
somewhere else on the disk, still marked as nodatacow, and the
reference count is reduced to 1 for each copy again. (Note that this
is done on a per-block basis, although the 30-second transaction
commit will tend to coalesce adjacent blocks to reduce fragmentation;
autodefrag helps here, too).

   Basically, a snapshot of a nodatacow file will increase the
reference count for its blocks. A write to a block with a reference
count of more than one will *always* write a new block elsewhere. A
write to a block with a reference count of exactly one will not do so
if the file is marked nodatacow. I hope that's clear.

 3/ Are both options mutually incompatible in some more osbcure ways ?

   Only as noted above.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- There are three mistaikes in this sentance. ---   


signature.asc
Description: Digital signature


Re: Filesystem unable to recover from ENOSPC

2014-04-10 Thread Hugo Mills
On Thu, Apr 10, 2014 at 01:00:35PM -0700, Chip Turner wrote:
 I have a filesystem that I can't seem to resolve ENOSPC issues.  No
 write operation can succeed; I've tried the wiki's suggestions
 (balancing, which fails because of ENOSPC, mounting with nodatacow,
 clear_cache, nospace_cache, enospc_debug, truncating files, deleting
 files, briefly microwaving the drive, etc).
 
 btrfs show:
 Label: none  uuid: 04283a32-b388-480b-9949-686675fad7df
 Total devices 1 FS bytes used 135.58GiB
 devid1 size 238.22GiB used 238.22GiB path /dev/sdb2
 
 btrfs fi df:
 Data, single: total=234.21GiB, used=131.82GiB
 System, single: total=4.00MiB, used=48.00KiB
 Metadata, single: total=4.01GiB, used=3.76GiB

   I'm surprised it's managed to use that much of the metadata
allocation. The FS usually hits this problem with a much smaller
used-to-total ratio.

 So, the filesystem is pretty much unusable, and I can find no way to
 resuscitate it.  I ended up in this state by creating a snapshot of
 the root of the fs into a read/write subvolume, which I wanted to
 become my new root, then began deleting entries in the filesystem
 itself outside of the new snapshot.  So nothing particularly weird or
 crazy.  The only oddness is the file count -- I have a *lot* of
 hardlinked files (this is an rsnapshot volume, so it has a large
 number of files and many of them are hard linked).
 
 It seems like the normal solution is btrfs balance, but that fails.
 defragment also fails.  Kernel is 3.13.
 
 Is there anything else I can or should do, or just wipe it and
 recreate with perhaps better initial defaults?  If this kind of thing
 is unavoidable, how might I have anticipated it and prevented it?
 Fortunately this was a migration effort and so my original data is
 safe inside of ext4 (ew).

   One thing you could do is btrfs dev add a small new device to the
filesystem (say, a USB stick, or a 4 GiB loopback file mounted over
NBD or something). Then run the filtered balance. Then btrfs dev del
the spare device.

   The fact that this FS has ended up in this state should be
considered a bug. It used to be quite common, then josef fixed a load
of problems, and it's been rare for about a year. Only recently we've
been seeing more of this kind of problem, and I think there's been a
bit of a regression somewhere.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Can I offer you anything? Tea? Seedcake? ---
 Glass of Amontillado?  


signature.asc
Description: Digital signature


Re: How to make BTRFS crawl

2014-04-11 Thread Hugo Mills
On Fri, Apr 11, 2014 at 07:11:52AM -0700, George Mitchell wrote:
 Well, Akonadi brought my system to its knees long before I converted to
 btrfs, so somehow I am not surprised.  I have kept akonadi disabled ever
 since.for everything except a portion of Thunderbird and that ONLY with
 sql-lite. Mysql will kill it in no time.  So I am not sure that btrfs is the
 root of the problem here.  Just my two cents, perhaps others have different
 experience with akonadi.

   My opinion (and it is purely opinion, since I haven't used any part
of KDE since the last millennium) is that akonadi probably isn't
massively efficient, *and* that it happens to hit a particular write
pattern that btrfs isn't handling too well. So I don't think it's fair
to point the blame solely at one or the other, but at the interaction
between bad (or awkward) behaviours of the two together.

   I'm surprised that it's showing very poor performance with the SSD,
though -- I'd have thought most of the performance loss would be in
additional seeks from the very fragmented file. Although with lots of
snapshots (e.g. snapper) going on, the benefits of reduced
fragmentation from the nodatacow are largely going to be lost because
each snapshot forces another round of CoWing and fragmentation.

   Hugo.

 On 04/11/2014 02:42 AM, Swâmi Petaramesh wrote:
 Hi,
 
 I was asked about situations use cases that would cause BTRFS to slow down
 to a crawl.
 
 And it's exactly what happened to me yesterday when I was trying, on the
 contrary, to speed it up.
 
 So here's the recipe for getting a slow to the point it is unusable BTRFS.
 
 
 1/ Perform a clean, fresh install of a recent distro with a 3.13 kernel (i.e.
 Fedora 20) and a BTRFS root filesystem.
 
 2/ Choose the version with a KDE interface
 
 3/ Configure fstab mountpoints using such options (space_cache will have been
 manually activated once):
 
 / btrfs   subvol=FEDORA,noatime,compress=lzo,autodefrag
 
 /home btrfs   subvol=HOME,noatime,compress=lzo,autodefrag
 
 
 4/ Use chattr +C to make the following directories NOCOW (move the old
 directory elsewhere, create a new dir, make it nocow, copy files from the old
 one so they are recreated with nocow, check permissions...):
 
 - /home/yourself/.cache
 - /home/yourself/.local/share/akonadi
 
 5/ Use IMAP mail in Kmail. Seriously process your email (it will be stored
 using akonadi mysql)
 
 6/ Surf normally the web using Firefox
 
 7/ Install SuSE snapper package that will perform a FS snapshot every hour.
 Configure it so it will snapshot both the root FS subvol and the /home subvol
 
 8/ Use the system for 24 hours and you will know that hardly usable 
 means...
 Especially every hour-on-the-hour when Kmail or Firefox will try to access
 files that have been recently snapshotted... Your system will be dead with
 saturated HD access for several *minutes*
 
 ...Hope this may help hunting this down...
 
 Kind regards.
 
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Geek, n.: Circus sideshow performer specialising in the --- 
 eating of live animals. 


signature.asc
Description: Digital signature


Re: Subvolumes and isolation

2014-04-14 Thread Hugo Mills
On Mon, Apr 14, 2014 at 10:38:45AM +, Holger Hoffstätte wrote:
 
 So I'm happily using subvolumes and snapshots and was wondering about
 subvolume low-level isolation. Assuming metadata=single, would a corrupt
 metadata block in one subvolume's directory tree affect any other subvolumes
 on the same physical partition, or would the fallout from this bad block be
 contained?

   With snapshots, potentially the FS trees can be shared as well
(that's what a snapshot is -- it's a CoW copy of the FS tree of a
subvol), so a corrupt block in the FS tree could be shared between the
subvols.

   With separately-created subvolumes (btrfs sub crea), the FS trees
will be independent from each other, but they will still share the
same extent tree (and all the other trees).

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- You've read the project plan.  Forget that. We're going to Do ---  
  Stuff and Have Fun doing it.   


signature.asc
Description: Digital signature


Re: Can I convert an existing directory into a subvolume?

2014-04-15 Thread Hugo Mills
On Tue, Apr 15, 2014 at 02:10:54PM +0100, Bob Williams wrote:
 Hi,
 
 I'm new to btrfs, just dipping my toes in the water...
 
 I've got two partitions, / on /dev/sda2 and /home on /dev/sda3, both
 formatted as btrfs in a new openSUSE 13.1 installation. I copied the
 whole of /home (4 users) into the btrfs formatted /home partition from
 an ext4 backup.
 
 I would like to create snapshots of /home/user/Documents for example,
 but I understand these have to be subvolumes first. Googling tells me I
 can't convert a conventional subdirectory into a subvolume, so I'm
 guessing I'll have to create a new /home/user/Documents subvolume and
 then copy all the contents from the subdirectory. Correct? Then delete
 the subdirectory?

   That's one way. You can refine the copy all the contents step by
using cp --reflink=always, which will make reflink (CoW) copies of the
data, which is vastly faster than an ordinary copy, as long as you're
not trying to take the data across a mount point.

   Another way is to make a snapshot of the subvolume containing the
thing you want to convert, and then delete the pieces you don't want
(possibly rearranging the contents of the new subvol in the process).
So, assuming you have your original subvol mounted on /home, and you
want to turn /home/bob into a subvol, it would go something like this:

# btrfs sub snap /home /home/bob-temp
# rm -rf /home/bob-temp/hugo /home/bob-temp/fred /home/bob-temp/wilma
# mv /home/bob-temp/bob/* /home/bob-temp/
# rmdir /home/bob-temp/bob
# mv /home/bob /home/bob-old
# mv /home/bob-temp /home/bob

   Both your approach and the one above involve deleting large
quantities of things, so be careful you don't delete too much. :)

 Can the subvolume have the same name as the subdirectory it is
 replacing, or should it be called something like 'tempDocs', and then
 renamed back to 'Documents' after the original has gone?

   It'll have to have a different name temporarily. Subvolumes live in
the same namespace as the rest of the filesystem objects (like files
and directories).

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- I believe that it's closely correlated with ---   
   the aeroswine coefficient.


signature.asc
Description: Digital signature


Re: [PATCH] btrfs-progs: read global reserve size from space infos

2014-04-22 Thread Hugo Mills
On Tue, Apr 22, 2014 at 03:20:00PM +0200, David Sterba wrote:
 Kernels 3.15  export the global block reserve as a space info presented
 by 'btrfs fi df' but would display 'unknown' instead of some meaningful
 string.
 
 Signed-off-by: David Sterba dste...@suse.cz
 ---
 
 Global_rsv or GlobalRsv or Globalrsv or something else?

   Personally, I'd probably go for the camel case GlobalRsv, or
possibly GlbReserve. (Assuming that it's going to be only a single
token without whitespace to make parsing easier).

   Hugo.

  cmds-filesystem.c | 2 ++
  ctree.h   | 2 ++
  2 files changed, 4 insertions(+)
 
 diff --git a/cmds-filesystem.c b/cmds-filesystem.c
 index 306f715475ac..5a3bbca91458 100644
 --- a/cmds-filesystem.c
 +++ b/cmds-filesystem.c
 @@ -129,6 +129,8 @@ static char *group_type_str(u64 flag)
   return Metadata;
   case BTRFS_BLOCK_GROUP_DATA|BTRFS_BLOCK_GROUP_METADATA:
   return Data+Metadata;
 + case BTRFS_SPACE_INFO_GLOBAL_RSV:
 + return Global_rsv;
   default:
   return unknown;
   }
 diff --git a/ctree.h b/ctree.h
 index a4d2cd114614..7e8ced718931 100644
 --- a/ctree.h
 +++ b/ctree.h
 @@ -861,6 +861,8 @@ struct btrfs_csum_item {
  /* used in struct btrfs_balance_args fields */
  #define BTRFS_AVAIL_ALLOC_BIT_SINGLE (1ULL  48)
  
 +#define BTRFS_SPACE_INFO_GLOBAL_RSV(1ULL  49)
 +
  #define BTRFS_QGROUP_STATUS_OFF  0
  #define BTRFS_QGROUP_STATUS_ON   1
  #define BTRFS_QGROUP_STATUS_SCANNING 2

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Some days,  it's just not worth gnawing through the straps. ---   


signature.asc
Description: Digital signature


Re: Slow Write Performance w/ No Cache Enabled and Different Size Drives

2014-04-22 Thread Hugo Mills
On Tue, Apr 22, 2014 at 11:42:09AM -0600, Chris Murphy wrote:
 
 On Apr 21, 2014, at 3:09 PM, Duncan 1i5t5.dun...@cox.net wrote:
 
  Adam Brenner posted on Sun, 20 Apr 2014 21:56:10 -0700 as excerpted:
  
  So ... BTRFS at this point in time, does not actually stripe the data
  across N number of devices/blocks for aggregated performance increase
  (both read and write)?
  
  What Chris says is correct, but just in case it's unclear as written, let 
  me try a reworded version, perhaps addressing a few uncaught details in 
  the process.
 
 Another likely problem is terminology. It's 2014 and still we don't have 
 consistency in basic RAID terminology. We're functionally in the 19th century 
 uncoordinated disagreement of weights and measures, except maybe worse 
 because we sometimes have multiple words that mean the same thing; as if 
 there were multiple words for the term gram or meter. It's just nonsensical 
 and selfish that this continues to persist across various file system 
 projects.
 
 It's not immediately obvious to the btrfs newcomer that the md raid chunk 
 isn't the same thing as the btrfs chunk, for example.
 
 And strip, chunk, stripe unit, and stripe size get used interchangeably to 
 mean the same thing, while just as often stripe size means something 
 different. The best definition I've found so far is IBM's stripe unit 
 definition: granularity at which data is stored on one drive of the array 
 before subsequent data is stored on the next drive of the array which is in 
 bytes. So that's the smallest raid unit we find on a drive, therefore it is a 
 base unit in RAID, and yet we have no agreement on what word to use.
 
 And it's not really like the storage industry trade association, SNIA, who 
 published a dictionary of terms in 2013, really helps in this area. I'll 
 argue they make it worse because they deprecate the term chunk, in favor of 
 the terms strip and stripe element. NO kidding, two terms mean the same 
 thing. Yet strip and stripe are NOT the same thing.
 
 strip = stripe element
 stripe = set of strips
 strip size = stripe depth
 stripe size = strip size * extents not including parity extents
 
 Also the units are in blocks (sectors, not fs blocks and not bytes). The 
 terms stripe unit, stripe width, and stride aren't found in the SNIA 
 dictionary at all although they are found as terms in other file system 
 projects.
 
 So no matter how we look at it, everyone else is doing it wrong.

   Also not helped by btrfs's co-option of the term RAID-1 to mean
something that's not traditional RAID-1, and (internally) stripe and
chunk to mean things that don't match (I think) any of the
definitions above...

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- A clear conscience.  Where did you get this taste ---
 for luxuries,  Bernard? 


signature.asc
Description: Digital signature


Re: Can anyone boot a system using btrfs root with linux 3.14 or newer?

2014-04-23 Thread Hugo Mills
On Wed, Apr 23, 2014 at 11:54:13AM -0700, Marc MERLIN wrote:
 On Wed, Apr 23, 2014 at 08:30:08PM +0300, Пламен Петров wrote:
  Can anyone boot a system using btrfs root with linux 3.14 or newer?
  
  Because I can't.
 
 It works fine for me.
 
  I'm trying to move some 3.13.x based systems to 3.14.x and the kernel panics
  during boot. It says to append a correct root=sdaX partition, but the one
  provided is correct, because if use 3.13.x with the same kernel command line
  - the system boots fine.
  
 My guess is that you have btrfs compiled as a module, it then needs to be in
 an initrd, and you either haven't built it and put it in the right place, or
 grub isn't setup to load that initrd.
 
  #menuentry 0
  title Linux
  root (hd0,0)
  kernel /vmlinuz rw root=/dev/sda2 vga=6 raid=noautodetect
 
 That's missing an initrd. Are you absolutely certain then that btrfs is
 compiled in the kernel and not as a module?

   And the other thing to check here is that if this is a multi-device
filesystem, you need to have your initrd run btrfs dev scan before
trying to mount.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- The glass is neither half-full nor half-empty; it is twice as ---  
large as it needs to be. 


signature.asc
Description: Digital signature


Re: Can anyone boot a system using btrfs root with linux 3.14 or newer?

2014-04-23 Thread Hugo Mills
On Wed, Apr 23, 2014 at 03:03:12PM -0700, Marc MERLIN wrote:
 On Thu, Apr 24, 2014 at 12:54:57AM +0300, Пламен Петров wrote:
   It may help to look up what error -38 translates into for that mount 
   error.
  
  My searches so far failed to return anything useful to solving this problem.
  
 Yeah, I searched before you :) this would require reading the kernel source
 to track down -38.
 (not hard, I just didn't do it)(

#define ENOSYS  38  /* Function not implemented */

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Computer Science is not about computers,  any more than --- 
 astronomy is about telescopes.  


signature.asc
Description: Digital signature


Re: Can anyone boot a system using btrfs root with linux 3.14 or newer?

2014-04-23 Thread Hugo Mills
On Wed, Apr 23, 2014 at 04:40:33PM -0600, Chris Murphy wrote:
 The screen shot provided makes it clear that one of the following kernel 
 parameters is incorrect:
 
  root=/dev/mapper/cryptroot
 
  rootflags=subvol=root
 
 So either the dmcrypt volume hasn't been opened, thus isn't available; or 
 rootfs isn't on a subvolume named root found at the top level of the file 
 system. So I'd say this isn't a btrfs problem, rather it's due to some 
 earlier misconfiguration that's preventing rootfs from being mounted.

   You'll need an initrd to run cryptsetup (or whatever) to collect a
passphrase and decrypt the volume before you try to mount it. This
sounds like a case for an initrd again.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Computer Science is not about computers,  any more than --- 
 astronomy is about telescopes.  


signature.asc
Description: Digital signature


Re: raid6, disks of different sizes, ENOSPC errors despite having plenty of space

2014-04-23 Thread Hugo Mills
On Wed, Apr 23, 2014 at 05:04:10PM -0400, Sergey Ivanyuk wrote:
 Hi,
 
 I have a filesystem that I've converted to raid6 from raid1, on 4 drives (I
 have another copy of the data):
 
 Total devices 4 FS bytes used 924.64GiB
 devid1 size 1.82TiB used 474.00GiB path /dev/sdd
 devid2 size 465.76GiB used 465.76GiB path /dev/sda
 devid3 size 465.76GiB used 465.76GiB path /dev/sdb
 devid4 size 465.76GiB used 465.73GiB path /dev/sdc
 
 Data, RAID6: total=924.00GiB, used=923.42GiB
 System, RAID1: total=32.00MiB, used=208.00KiB
 Metadata, RAID1: total=1.70GiB, used=1.28GiB
 Metadata, DUP: total=384.00MiB, used=252.13MiB
 unknown, single: total=512.00MiB, used=0.00
 
 
 Recent btrfs-progs built from source, kernel 3.15.0-rc2 on armv7l. Despite
 having plenty of space left on the larger drive, attempting to copy more
 data onto the filesystem results in a kworker process pegged at 100% CPU
 for a very long time (10s of minutes), at which point the writes proceed
 for some time, and the process repeats until the eventual No space left on
 device error. Balancing fails with the same error, even if attempting to
 convert back to raid1.
 
 I realize that this likely has something to do with the disparity between
 device sizes, and per the wiki a fixed-width stripe may help, though I'm
 not sure if it's possible to change the stripe width in my situation, since
 I can't rebalance. Is there anything I can do to get this filesystem back
 to writable state?

   With those device sizes, yes, you're going to have limits on the
available data you can store -- with RAID-6, it'll be 465.76*(4-2) =
931.52 GB (less metadata space), so your conclusion above is indeed
correct.

   We don't have the fixed-width stripe feature implemented yet, which
probably explains why you can't use it. :) You can play with an
approximation of the consequences, once the feature is there, at
http://carfax.org.uk/btrfs-usage/ . Without that feature, though,
there's not much you can do to improve the situation. What might help
in converting back to RAID-1 is adding a small device to the FS
temporarily before doing the conversion, and then removing it again
afterwards.

 Also, here's a stack trace for the stuck kworker process, which appears to
 be a bug since it does this for a very long time:

   This is probably something different.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Computer Science is not about computers,  any more than --- 
 astronomy is about telescopes.  


signature.asc
Description: Digital signature


Re: Can anyone boot a system using btrfs root with linux 3.14 or newer?

2014-04-23 Thread Hugo Mills
On Wed, Apr 23, 2014 at 03:50:18PM -0700, Marc MERLIN wrote:
 On Wed, Apr 23, 2014 at 11:43:03PM +0100, Hugo Mills wrote:
  On Wed, Apr 23, 2014 at 04:40:33PM -0600, Chris Murphy wrote:
   The screen shot provided makes it clear that one of the following kernel 
   parameters is incorrect:
   
root=/dev/mapper/cryptroot
   
rootflags=subvol=root
   
   So either the dmcrypt volume hasn't been opened, thus isn't available; or 
   rootfs isn't on a subvolume named root found at the top level of the file 
   system. So I'd say this isn't a btrfs problem, rather it's due to some 
   earlier misconfiguration that's preventing rootfs from being mounted.
  
 You'll need an initrd to run cryptsetup (or whatever) to collect a
  passphrase and decrypt the volume before you try to mount it. This
  sounds like a case for an initrd again.
 
 I think you are confused, that's my config I pasted as an example.
 He does not use dmcrypt in his example.

   Ah, OK. Never mind, then.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Computer Science is not about computers,  any more than --- 
 astronomy is about telescopes.  


signature.asc
Description: Digital signature


Re: btrfs send receive, clone

2014-04-24 Thread Hugo Mills
On Thu, Apr 24, 2014 at 09:23:28AM -0600, Chris Murphy wrote:
 
 
 I don't understand the btrfs send -c clone-src man page text, or really 
 even the use case. In part this is what it says:
 
  You must not specify clone sources unless you
   guarantee that these snapshots are exactly in the same state on both
   sides, the sender and the receiver.
 
 If the snapshots are the same on both sides, then why would I be using clone 
 in the first place?

   To copy over another snapshot which shares data with them.

  -c clone-src Use this snapshot as a clone source for an 
  incremental send (multiple allowed)
 
 Incremental send implies the sender and receiver are not in the same state 
 now, but will be after the command is executed. Is one, or both, snapshots rw 
 for -c?
 
 Anyway, I'm lost on the specifics, but clearly I'm even lost when it comes to 
 the basic difference between -p and -c.

(Note: I've not actually tried the second case in what follows, but
it's what I think is going on. This may be subject to corrections.)

   OK, call the sending system S and the receiving system R. Let's
say we've got three subvolumes on S:

S:A2, the current /home (say)
S:A1, a snapshot of an earlier version of S:A2
S:B, a separate subvolume that's had some CoW copies of files in both
 S:A1 and S:A2 made into it.

   If we send S:A1 to R, then we'll have to send the whole thing,
because R doesn't have any subvolumes yet.

   If we now want to send S:A2 to R, then we can use -p S:A1, and it
will send just the differences between those two. This means that the
send stream can potentially ignore a load of the metadata as well as
the data. It's effectively saying, you can clone R:A1, then do these
things to it to get R:A2.

   If we now want to send S:B to R, then we can use -c S:A1 -c S:A2.
Note that S:B doesn't have any metadata in common with either of the
As, only data. This will send all of the metadata (start with an
empty subvolume and do these things to it to get R:B), but because
it's known to share data with some subvols on S, and those subvols
also exist on R, we can avoid sending that data again by simply
specifying where the data can be found and reflinked from on R.

   So, if you have a load of snapshots, you can do one of two things
to duplicate all of them:

btrfs sub send snap 0
for n=1 to N
   btrfs sub send -p snap n-1 snap n

   Or, in any order,

btrfs sub send snap s1
for n=1 to N
   btrfs sub send -c snap s1 -c snap s2 -c snap s3 ... snap sn

where each subvolume that's been sent before gets added as a -c to the
next send command. This second approach means that all possible
reflinks between subvolumes can be captured, but it will send all of
the metadata across each time. The first approach may lose some manual
reflink efficiency, but is better at sending only the necessary
changed metadata. You should be able to combine the two methods, I
think.

   I'm trying to think of a case where -c is useful that doesn't
involve someone having done cp --reflink=always between subvolumes,
but I can't. So, I think the summary is:

 * Use -p to deal with parent-child reflinks through snapshots
 * Use -c to specify other subvolumes (present on both sides) that
   might contain reflinked data

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Well, you don't get to be a kernel hacker simply by looking ---   
good in Speedos. -- Rusty Russell


signature.asc
Description: Digital signature


Re: btrfs send receive, clone

2014-04-24 Thread Hugo Mills
On Thu, Apr 24, 2014 at 04:55:10PM +0100, Hugo Mills wrote:
I'm trying to think of a case where -c is useful that doesn't
 involve someone having done cp --reflink=always between subvolumes,
 but I can't.

   OK, you can use -c if you don't have a record of the relationships
between the subvolumes you want to send, but know that they're related
in some way. As above, you send the first subvol bare, and then
supply a -c for each one that you've already sent.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- echo killall cat  ~/curiosity.sh ---   


signature.asc
Description: Digital signature


Re: btrfs send receive, clone

2014-04-24 Thread Hugo Mills
On Thu, Apr 24, 2014 at 11:22:40AM -0600, Chris Murphy wrote:
 
 On Apr 24, 2014, at 9:55 AM, Hugo Mills h...@carfax.org.uk wrote:
 
  On Thu, Apr 24, 2014 at 09:23:28AM -0600, Chris Murphy wrote:
  
  
  I don't understand the btrfs send -c clone-src man page text, or really 
  even the use case. In part this is what it says:
  
  You must not specify clone sources unless you
  guarantee that these snapshots are exactly in the same state on both
  sides, the sender and the receiver.
  
  If the snapshots are the same on both sides, then why would I be using 
  clone in the first place?
  
To copy over another snapshot which shares data with them.
  
  -c clone-src Use this snapshot as a clone source for an 
  incremental send (multiple allowed)
  
  Incremental send implies the sender and receiver are not in the same state 
  now, but will be after the command is executed. Is one, or both, snapshots 
  rw for -c?
  
  Anyway, I'm lost on the specifics, but clearly I'm even lost when it comes 
  to the basic difference between -p and -c.
  
  (Note: I've not actually tried the second case in what follows, but
  it's what I think is going on. This may be subject to corrections.)
  
OK, call the sending system S and the receiving system R. Let's
  say we've got three subvolumes on S:
  
  S:A2, the current /home (say)
  S:A1, a snapshot of an earlier version of S:A2
  S:B, a separate subvolume that's had some CoW copies of files in both
  S:A1 and S:A2 made into it.
  
If we send S:A1 to R, then we'll have to send the whole thing,
  because R doesn't have any subvolumes yet.
  
If we now want to send S:A2 to R, then we can use -p S:A1, and it
  will send just the differences between those two. This means that the
  send stream can potentially ignore a load of the metadata as well as
  the data. It's effectively saying, you can clone R:A1, then do these
  things to it to get R:A2.
  
If we now want to send S:B to R, then we can use -c S:A1 -c S:A2.
 
 OK this makes sense now, thanks.
 
 Does the use of -c always require at least two -c instances? Is there an 
 example where -c is used once? From the man page I'm not groking that there 
 must be at least two -c's.

   No, my understanding is that you could have any number (0 or more).
It just allows the sending side to tell the receiving side that
there's some shared data in use that it's already got the data for,
and it just needs to hook up the extents. The reason I used two -cs
above was because there's data that S:B shares with those two
subvolumes (because that's the example scenario I picked). If S:B only
shared with one subvolume, you would use only one -c.

I'm trying to think of a case where -c is useful that doesn't
  involve someone having done cp --reflink=always between subvolumes,
  but I can't.
 
 OK.
 
 
  So, I think the summary is:
  
  * Use -p to deal with parent-child reflinks through snapshots
  * Use -c to specify other subvolumes (present on both sides) that
might contain reflinked data
 
 I think the key is that -c implies a minimum of five subvolumes: two 
 subvolumes on the source, which have (identical) counterparts on the 
 destination (that's four subvolumes), and then one additional somehow related 
 subvolume B on the source that I want on the destination.

   No, -c implies three subvolumes that exist: the one provided to the
-c, which must exist on both sides as a data source, and the one being
sent, which exists on the sending side, and will be recreated on the
receiving side, with any shared extents replicated.

 Whereas -p implies three subvolumes (one on the source which is the parent, 
 its counterpart on the destination, and a child on the source which I want on 
 the destination). I necessarily must understand the relationship among them 
 in order to get the desired incremental result on the destination.

   I don't think you have to know that the subvol being sent is the
child of the subvol provided with -p. I suspect that the operation
would work just as well round the other way (i.e., if you've already
sent the latest snapshot, you could do a cheaper copy of older
snapshots by sending them with -p latest_subvol). Remember, there's
not really any deep FS-level concept of parent/child with snapshots.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Sometimes, when I'm alone, I Google myself. ---   


signature.asc
Description: Digital signature


Re: safe/necessary to balance system chunks?

2014-04-25 Thread Hugo Mills
On Fri, Apr 25, 2014 at 02:12:17PM -0400, Austin S Hemmelgarn wrote:
 On 2014-04-25 13:24, Chris Murphy wrote:
  
  On Apr 25, 2014, at 8:57 AM, Steve Leung sjle...@shaw.ca wrote:
  
 
  Hi list,
 
  I've got a 3-device RAID1 btrfs filesystem that started out life as 
  single-device.
 
  btrfs fi df:
 
  Data, RAID1: total=1.31TiB, used=1.07TiB
  System, RAID1: total=32.00MiB, used=224.00KiB
  System, DUP: total=32.00MiB, used=32.00KiB
  System, single: total=4.00MiB, used=0.00
  Metadata, RAID1: total=66.00GiB, used=2.97GiB
 
  This still lists some system chunks as DUP, and not as RAID1.  Does this 
  mean that if one device were to fail, some system chunks would be 
  unrecoverable?  How bad would that be?
  
  Since it's system type, it might mean the whole volume is toast if the 
  drive containing those 32KB dies. I'm not sure what kind of information is 
  in system chunk type, but I'd expect it's important enough that if 
  unavailable that mounting the file system may be difficult or impossible. 
  Perhaps btrfs restore would still work?
  
  Anyway, it's probably a high penalty for losing only 32KB of data.  I think 
  this could use some testing to try and reproduce conversions where some 
  amount of system or metadata type chunks are stuck in DUP. This has 
  come up before on the list but I'm not sure how it's happening, as I've 
  never encountered it.
 
 As far as I understand it, the system chunks are THE root chunk tree for
 the entire system, that is to say, it's the tree of tree roots that is
 pointed to by the superblock. (I would love to know if this
 understanding is wrong).  Thus losing that data almost always means
 losing the whole filesystem.

   From a conversation I had with cmason a while ago, the System
chunks contain the chunk tree. They're special because *everything* in
the filesystem -- including the locations of all the trees, including
the chunk tree and the roots tree -- is positioned in terms of the
internal virtual address space. Therefore, when starting up the FS,
you can read the superblock (which is at a known position on each
device), which tells you the virtual address of the other trees... and
you still need to find out where that really is.

   The superblock has (I think) a list of physical block addresses at
the end of it (sys_chunk_array), which allows you to find the blocks
for the chunk tree and work out this mapping, which allows you to find
everything else. I'm not 100% certain of the actual format of that
array -- it's declared as u8 [2048], so I'm guessing there's a load of
casting to something useful going on in the code somewhere.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Is it still called an affair if I'm sleeping with my wife ---
behind her lover's back?


signature.asc
Description: Digital signature


Re: Confusing output of btrfs fi df

2014-04-26 Thread Hugo Mills
On Sat, Apr 26, 2014 at 04:09:15PM +0200, Stefan Malte Schumacher wrote:
 Hello
 
 Yesterday I created a btrfs-filesystem on two disk, using raid1 for
 data and metadata. I then mounted it and rsynced several TB of data
 onto it.
 
 mkfs.btrfs -m raid1 -d raid1 /dev/sdf /dev/sdg
 
 The command btrfs fi df /mnt/btrfs result in the following output:
 
 Data, RAID1: total=2.64TiB, used=2.22TiB
 Data, single: total=8.00MiB, used=0.00
 System, RAID1: total=8.00MiB, used=380.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, RAID1: total=4.00GiB, used=2.94GiB
 Metadata, single: total=8.00MiB, used=0.00
 
 I am a bit confused because of the single-entries. They are not
 shown in the UseCases-example on the btrfs-website and I wonder if I
 did something wrong.

   They're harmless -- it's a side-effect of the way that mkfs works.
They'll go away if you balance them:

   btrfs balance start -dprofiles=single -mprofiles=single -sprofiles=single 
/mountpoint

 I also would like to know if its possible to label a multi-disk
 filesystem after creation. 

   btrfs fi label should do this.

 For your information, I am using Btrfs v3.12+20131125 and kernel
 3.11.10-7 64bit. My distribution is an openSUSE 13.1.

   You might want to look at upgrading to 3.13 or 3.14 kernel, which
has 6 months or so extra bug fixes in it.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Keming (n.) The result of poor kerning ---  


signature.asc
Description: Digital signature


Re: kernel BUG at /home/apw/COD/linux/fs/btrfs/extent_io.c:2116! when deleting device or balancing filesystem.

2014-04-28 Thread Hugo Mills
On Mon, Apr 28, 2014 at 03:26:45AM +, Duncan wrote:
 Jaap Pieroen posted on Sun, 27 Apr 2014 18:30:19 +0200 as excerpted:
 
  Hello,
  
  When I try to delete a device from my btrfs filesystem I always get the
  following kernel bug error:
 
  kernel BUG at /home/apw/COD/linux/fs/btrfs/extent_io.c:2116!
  invalid opcode:  [#3] SMP
  See attached log file for more details.
 
 That's a reasonably common, generic error, simply indicating the kernel 
 got an invalid/zero opcode instead of what it was supposed to get, but 
 not really saying why, tho the log does give some more info.

   More than that -- the invalid opcode is simply the way that the
BUG() and BUG_ON() macros are implemented.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- People are too unreliable to be replaced by machines. ---  


signature.asc
Description: Digital signature


Re: Confusing output of btrfs fi df

2014-04-28 Thread Hugo Mills
On Mon, Apr 28, 2014 at 01:57:02PM +0200, Stefan Malte Schumacher wrote:
 
 
  So try this one:
  btrfs balance start -musage=0 -v
 
 I fear that didn't work too. 
 
 mars:/mnt # btrfs balance start -musage=0 -v btrfs/
 Dumping filters: flags 0x6, state 0x0, force is off
   METADATA (flags 0x2): balancing, usage=0
   SYSTEM (flags 0x2): balancing, usage=0
   Done, had to relocate 1 out of 2708 chunks
   
 mars:/mnt # btrfs fi df btrfs/
 Data, RAID1: total=2.64TiB, used=2.22TiB
 System, RAID1: total=8.00MiB, used=380.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, RAID1: total=4.00GiB, used=2.94GiB
 
 
 If that fails to remove the extra system chunk, then we have a mystery
 indeed.  What's different on your system and why isn't it working?
 
 I have no idea. Its just a plain openSUSE 13.1 and they consider btrfs
 support stable enough to use it as default filesystem in the upcoming
 13.2. I could create the filesystem again and restore the data but of
 course I would actually need to know what went wrong the first time in
 order to avoid doing it again. Is there anything you need to know
 about my system which would be of use? (Controller, Disks, Mainboard
 etc. ?)  

   The question is, why is this important?

   The presence of that area won't affect the operation of the FS in
the slightest. The FS won't write any data to that area, and it's only
4MiB in size -- completely lost in the noise for a 2.6 TiB filesystem.
At worst, it's an extra line of output; slightly messy, but utterly
harmless.

   I think the default kernel for OpenSuSE 13.1 is 3.11, which may be
old enough that it doesn't have the patch that allows balancing of
chunk 0 (which is probably what's happening here).

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- You are demons,  and I am in Hell! Well, technically, it's ---  
   London,  but it's an easy mistake to make.   


signature.asc
Description: Digital signature


Re: EBS volumes with identical UUIDs + btrfs

2014-04-29 Thread Hugo Mills
On Tue, Apr 29, 2014 at 02:44:08PM -0700, Brandon Philips wrote:
 Hello All-
 
 I attached an AWS EBS volume to `xvdh` that was from a terminated EC2
 machine to another machine. The filesystem shared a btrfs UUID since
 they came from an identical install. When I mounted the new EBS volume
 to /mnt something very odd happened:
 
 Before:
 
 $ mount
 /dev/xvda9 on / type btrfs (rw,relatime,ssd,space_cache)
 /dev/xvda3 on /usr type ext4 (ro,relatime)
 
 After:
 
 # mount /dev/xvdh9 /mnt
 
 # mount
 /dev/xvdh9 on / type btrfs (rw,relatime,ssd,space_cache)
 /dev/xvdh9 on /mnt type btrfs (rw,relatime,ssd,space_cache)
 
 It seems that btrfs gets very confused when there are matching UUIDs
 and /mnt didn't contain the contents that I expected. To work around
 the issue I booted a non-identical machine image that had a different
 btrfs UUID and attached the backup EBS volume again and everything
 worked as expected.
 
 What is the right way of handling this?

   The only solution that there is right now is, don't do that.
btrfs basically assumes that if several block devices have the same
UUID in their btrfs superblocks, they're different parts of the same
filesystem. If they're actually clones of the same filesystem, then it
has problems, and can _really_ screw things up, as you've discovered.

   The closest thing to a good solution that's been proposed so far
is to have a tool that will scan the metadata on a block device (or a
set of block devices making up a filesystem) and rewrite the FS UUID
embedded in every metadata block. This is likely to be expensive.

   To do the conversion, you'll have to either (a) load the chunk tree
and only scan the metadata chunks, or (b) scan the whole FS for things
that look like metadata blocks and convert every block you find. In
either case, you'll have to supply exact names for the block device(s)
to convert -- preferably as a whole (particularly in case (a), where
you need all that info to find the current chunk tree).

   Option (a) is useful if you already have the clones -- but given
the behaviour of most udev installations these days, that's already
got you in a dangerous position, because udev has probably already
detected the new devices and run btrfs dev scan on them. Option (b) is
handy if you want to treat the image as a stream (e.g. dd if=/dev/sda
| btrfs fi set-uuid --stream | dd of=/dev/sdb)

   Needless to say, neither of these has actually been implemented
yet.

   Hugo.

 Attaching EBS volumes from
 snapshots or old identical machines is a common use case.
 
 Thanks!
 
 Brandon

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- If you're not part of the solution, you're part --- 
   of the precipiate.


signature.asc
Description: Digital signature


Re: Help with space

2014-05-02 Thread Hugo Mills
On Fri, May 02, 2014 at 01:21:50PM -0600, Chris Murphy wrote:
 
 On May 2, 2014, at 2:23 AM, Duncan 1i5t5.dun...@cox.net wrote:
  
  Something tells me btrfs replace (not device replace, simply replace) 
  should be moved to btrfs device replace…
 
 The syntax for btrfs device is different though; replace is like balance: 
 btrfs balance start and btrfs replace start. And you can also get a status on 
 it. We don't (yet) have options to stop, start, resume, which could maybe 
 come in handy for long rebuilds and a reboot is required (?) although maybe 
 that just gets handled automatically: set it to pause, then unmount, then 
 reboot, then mount and resume.
 
  Well, I'd say two copies if it's only two devices in the raid1... would 
  be true raid1.  But if it's say four devices in the raid1, as is 
  certainly possible with btrfs raid1, that if it's not mirrored 4-way 
  across all devices, it's not true raid1, but rather some sort of hybrid 
  raid,  raid10 (or raid01) if the devices are so arranged, raid1+linear if 
  arranged that way, or some form that doesn't nicely fall into a well 
  defined raid level categorization.
 
 Well, md raid1 is always n-way. So if you use -n 3 and specify three devices, 
 you'll get 3-way mirroring (3 mirrors). But I don't know any hardware raid 
 that works this way. They all seem to be raid 1 is strictly two devices. At 4 
 devices it's raid10, and only in pairs.
 
 Btrfs raid1 with 3+ devices is unique as far as I can tell. It is something 
 like raid1 (2 copies) + linear/concat. But that allocation is round robin. I 
 don't read code but based on how a 3 disk raid1 volume grows VDI files as 
 it's filled it looks like 1GB chunks are copied like this
 
 Disk1 Disk2   Disk3
 134   124 235
 679   578 689
 
 So 1 through 9 each represent a 1GB chunk. Disk 1 and 2 each have a chunk 1; 
 disk 2 and 3 each have a chunk 2, and so on. Total of 9GB of data taking up 
 18GB of space, 6GB on each drive. You can't do this with any other raid1 as 
 far as I know. You do definitely run out of space on one disk first though 
 because of uneven metadata to data chunk allocation.

   The algorithm is that when the chunk allocator is asked for a block
group (in pairs of chunks for RAID-1), it picks the number of chunks
it needs, from different devices, in order of the device with the most
free space. So, with disks of size 8, 4, 4, you get:

Disk 1: 12345678
Disk 2: 1357
Disk 3: 2468

and with 8, 8, 4, you get:

Disk 1: 1234568A
Disk 2: 1234579A
Disk 3: 6789

   Hugo.

 Anyway I think we're off the rails with raid1 nomenclature as soon as we have 
 3 devices. It's probably better to call it replication, with an assumed 
 default of 2 replicates unless otherwise specified.
 
 There's definitely a benefit to a 3 device volume with 2 replicates, 
 efficiency wise. As soon as we go to four disks 2 replicates it makes more 
 sense to do raid10, although I haven't tested odd device raid10 setups so I'm 
 not sure what happens.
 
 
 Chris Murphy
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Prisoner unknown:  Return to Zenda. ---   


signature.asc
Description: Digital signature


Re: copies= option

2014-05-04 Thread Hugo Mills
 implementation of
 N-way-mirroring as soon as possible after raid56 completion, because I
 really /really/ want N-way-mirroring, and this other thing would
 certainly be extremely nice, but I'm quite fearful that it could also be
 the perfect being the enemy of the good-enough, and btrfs already has a
 long history of features repeatedly taking far longer to implement than
 originally predicted, which with something that potentially complex,
 I'm very afraid could mean a 2-5 year wait before it's actually usable.
 
 And given how long I've been waiting for the simple-compared-to-that
 N-way-mirroring thing and how much I anticipate it, I just don't know
 what I'd do if I were to find out that they were going to work on this
 perfect thing instead, with N-way-mirroring being one possible option
 with it, but that as a result, given the btrfs history to date, it'd
 very likely be a good five years before I could get the comparatively
 simple N-way-mirroring (or even, for me, just a specific
 3-way-mirroring to compliment the specific 2-way-mirroring that's
 already there) that's all I'm really asking for.
 
 So I guess you can see why I don't want to get into the details of the
 more fancy solution too much, both as a means of protecting my own
 sanity, and to hopefully avoid throwing the 3-way-mirroring that's my
 own personal focal point off the track.  So Hugo's the one with the
 details, to the extent they've been discussed at least, there.
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- emacs:  Eighty Megabytes And Constantly Swapping. ---


signature.asc
Description: Digital signature


Re: How does Suse do live filesystem revert with btrfs?

2014-05-04 Thread Hugo Mills
On Sun, May 04, 2014 at 04:26:45PM -0700, Marc MERLIN wrote:
 Actually, never mind Suse, does someone know whether you can revert to
 an older snapshot in place?

   Not while the system's running useful services, no.

 The only way I can think of is to mount the snapshot on top of the other
 filesystem. This gets around the umounting a filesystem with open
 filehandles problem, but this also means that you have to keep track of
 daemons that are still accessing filehandles on the overlayed
 filesystem.

   You have a good handle on the problems.

 My one concern with this approach is that you can't free up the
 subvolume/snapshot of the underlying filesystem if it's mounted and even
 after you free up filehandles pointing to it, I don't think you can
 umount it.
 
 In other words, you can play this trick to delay a reboot a bit, but
 ultimately you'll have to reboot to free up the mountpoints, old
 subvolumes, and be able to delete them.

   Yup.

 Somehow I'm thinking Suse came up with a better method.

   I'm guessing it involves reflink copies of files from the snapshot
back to the original, and then restarting affected services. That's
about the only other thing that I can think of, but it's got load of
race conditions in it (albeit difficult to hit in most cases, I
suspect).

   Hugo.

 Even if you don't know Suse, can you think of a better way to do this?
 
 Thanks,
 Marc
 
 On Sat, May 03, 2014 at 05:52:57PM -0700, Marc MERLIN wrote:
  (more questions I'm asking myself while writing my talk slides)
  
  I know Suse uses btrfs to roll back filesystem changes.
  
  So I understand how you can take a snapshot before making a change, but
  not how you revert to that snapshot without rebooting or using rsync,
  
  How do you do a pivot-root like mountpoint swap to an older snapshot,
  especially if you have filehandles opened on the current snapshot?
  
  Is that what Suse manages, or are they doing something simpler?
  
  Thanks,
  Marc
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- That's not rain,  that's a lake with slots in it. ---


signature.asc
Description: Digital signature


Re: Unable to boot

2014-05-05 Thread Hugo Mills
 kernels first, then 
 recovery mount options first. Sometimes the repair option makes things worse. 
 I'm not sure what its safety status is as of v3.14.
 
 https://btrfs.wiki.kernel.org/index.php/Problem_FAQ
 
 Fedora includes btrfs-zero-log already so depending on the kernel messages 
 you might try that before a btrfsck --repair.
 
 
 
 Chris Murphy
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- It's against my programming to impersonate a deity! ---   


signature.asc
Description: Digital signature


[PATCH 1/3] btrfs check: Fix wrong level access

2014-05-05 Thread Hugo Mills
There's no reason to assume that the bad key order is in a leaf block,
so accessing level 0 of the path is going to be an error if it's actually
a node block that's bad.

Reported-by: Chris Mason c...@fb.com
Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 cmds-check.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index d195e7a..fc84ad8 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2418,6 +2418,7 @@ static int try_to_fix_bad_block(struct btrfs_trans_handle 
*trans,
struct btrfs_path *path;
struct btrfs_key k1, k2;
int i;
+   int level;
int ret;
 
if (status != BTRFS_TREE_BLOCK_BAD_KEY_ORDER)
@@ -2435,9 +2436,10 @@ static int try_to_fix_bad_block(struct 
btrfs_trans_handle *trans,
if (!path)
return -EIO;
 
-   path-lowest_level = btrfs_header_level(buf);
+   level = btrfs_header_level(buf);
+   path-lowest_level = level;
path-skip_check_block = 1;
-   if (btrfs_header_level(buf))
+   if (level)
btrfs_node_key_to_cpu(buf, k1, 0);
else
btrfs_item_key_to_cpu(buf, k1, 0);
@@ -2448,9 +2450,9 @@ static int try_to_fix_bad_block(struct btrfs_trans_handle 
*trans,
return -EIO;
}
 
-   buf = path-nodes[0];
+   buf = path-nodes[level];
for (i = 0; i  btrfs_header_nritems(buf) - 1; i++) {
-   if (btrfs_header_level(buf)) {
+   if (level) {
btrfs_node_key_to_cpu(buf, k1, i);
btrfs_node_key_to_cpu(buf, k2, i + 1);
} else {
-- 
1.9.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs check: Pre-sort keys in a block while searching

2014-05-05 Thread Hugo Mills
When we think we might have a messed-up block with keys out of order
(e.g. during fsck), we still need to be able to find a key in the block.
To deal with this, we copy the keys, keeping track of where they came from
in the original node/leaf, sort them, and then do the binary search.

Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 cmds-check.c |  1 +
 ctree.c  | 86 ++--
 ctree.h  |  2 ++
 3 files changed, 75 insertions(+), 14 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index fc84ad8..b2e4a46 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2439,6 +2439,7 @@ static int try_to_fix_bad_block(struct btrfs_trans_handle 
*trans,
level = btrfs_header_level(buf);
path-lowest_level = level;
path-skip_check_block = 1;
+   path-bin_search_presort = 1;
if (level)
btrfs_node_key_to_cpu(buf, k1, 0);
else
diff --git a/ctree.c b/ctree.c
index 9e5b30f..30e1785 100644
--- a/ctree.c
+++ b/ctree.c
@@ -388,6 +388,16 @@ int btrfs_comp_cpu_keys(struct btrfs_key *k1, struct 
btrfs_key *k2)
return 0;
 }
 
+int btrfs_comp_disk_keys(struct btrfs_disk_key *dk1,
+struct btrfs_disk_key *dk2)
+{
+   struct btrfs_key k1, k2;
+
+   btrfs_disk_key_to_cpu(k1, dk1);
+   btrfs_disk_key_to_cpu(k2, dk2);
+   return btrfs_comp_cpu_keys(k1, k2);
+}
+
 /*
  * compare two keys in a memcmp fashion
  */
@@ -598,25 +608,73 @@ static int generic_bin_search(struct extent_buffer *eb, 
unsigned long p,
return 1;
 }
 
+static int cmp_disk_keys(const void *k1, const void *k2)
+{
+   return btrfs_comp_disk_keys((struct btrfs_disk_key *)k1, (struct 
btrfs_disk_key *)k2);
+}
+
+/* Copy the item keys and their original positions into a second
+ * extent buffer, which can be safely passed to generic_bin_search in
+ * the case where the keys might be out of order.
+ */
+static void sort_key_copy(struct extent_buffer *tgt, struct extent_buffer *src,
+ int offset, int item_size, int nitems)
+{
+   struct btrfs_disk_key *src_item;
+   struct btrfs_item *tgt_item;
+   int i;
+
+   for (i = 0; i  nitems; i++) {
+   /* We abuse the struct btrfs_item slightly here: the key
+* is the key we care about; the offset field is the
+* original slot number */
+   src_item = (struct btrfs_disk_key *)(src-data + offset + 
i*item_size);
+   tgt_item = (struct btrfs_item *)(tgt-data + i*sizeof(struct 
btrfs_item));
+   memcpy(tgt_item, src_item, sizeof(struct btrfs_disk_key));
+   tgt_item-offset = i;
+   }
+   qsort(tgt-data, nitems, sizeof(struct btrfs_item), cmp_disk_keys);
+}
+
 /*
  * simple bin_search frontend that does the right thing for
  * leaves vs nodes
  */
 static int bin_search(struct extent_buffer *eb, struct btrfs_key *key,
- int level, int *slot)
+ int level, int pre_sort, int *slot)
 {
-   if (level == 0)
-   return generic_bin_search(eb,
- offsetof(struct btrfs_leaf, items),
- sizeof(struct btrfs_item),
- key, btrfs_header_nritems(eb),
- slot);
-   else
-   return generic_bin_search(eb,
- offsetof(struct btrfs_node, ptrs),
- sizeof(struct btrfs_key_ptr),
- key, btrfs_header_nritems(eb),
- slot);
+   struct extent_buffer *sorted = NULL;
+   int ret;
+   int offset, size, nritems;
+
+   if (level == 0) {
+   offset = offsetof(struct btrfs_leaf, items);
+   size = sizeof(struct btrfs_item);
+   } else {
+   offset = offsetof(struct btrfs_node, ptrs);
+   size = sizeof(struct btrfs_key_ptr);
+   }
+   nritems = btrfs_header_nritems(eb);
+
+   if (pre_sort) {
+   sorted = alloc_extent_buffer(eb-tree, eb-dev_bytenr, eb-len);
+   sort_key_copy(sorted, eb, offset, size, nritems);
+   offset = 0;
+   size = sizeof(struct btrfs_item);
+   eb = sorted;
+   }
+
+   ret = generic_bin_search(eb, offset, size, key, nritems, slot);
+
+   if (pre_sort) {
+   /* We have the sorted slot number, which is probably unhelpful
+  if the sort changed the order. So, return the original slot
+  number, not the sorted position. */
+   *slot = ((struct btrfs_item *)(eb-data + 
(*slot)*size))-offset;
+   free_extent_buffer(sorted);
+   }
+
+   return ret;
 }
 
 struct extent_buffer *read_node_slot(struct btrfs_root *root,
@@ -1075,7 +1133,7 @@ again

[PATCH 3/3] btrfs check: Attempt to fix misordered keys with bitflips in them

2014-05-05 Thread Hugo Mills
If someone has had bad RAM which has been used to store a metadata block,
there's a chance that one or more of the keys has had a bit changed. The
block checksum doesn't help us here, because it's made on the bad data.

To fix this, if there's a block with a bad key order in it, we find out-of-
order keys by bracketing them between the good keys either side of them,
and then attempting to flip each bit of the key in turn.

If precisely one of those bitflips puts the broken key back into order
relative to its two neighbours, we probably have a fix for the bitflip,
and so we write it back to the FS.

This doesn't repair bitflipped keys at the start or end of a metadata
block, nor bitflips in any other data structure.

Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 cmds-check.c | 103 ---
 1 file changed, 99 insertions(+), 4 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index b2e4a46..c93fea3 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2406,8 +2406,9 @@ static int swap_values(struct btrfs_root *root, struct 
btrfs_path *path,
 }
 
 /*
- * Attempt to fix basic block failures.  Currently we only handle bad key
- * orders, we will cycle through the keys and swap them if necessary.
+ * Attempt to fix basic block failures. Currently we only handle bad
+ * key orders, we will look for fixable bitflips, and also cycle
+ * through the keys and swap them if necessary.
  */
 static int try_to_fix_bad_block(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
@@ -2416,8 +2417,9 @@ static int try_to_fix_bad_block(struct btrfs_trans_handle 
*trans,
enum btrfs_tree_block_status status)
 {
struct btrfs_path *path;
-   struct btrfs_key k1, k2;
-   int i;
+   struct btrfs_key k1, k2, k3;
+   int i, j;
+   int bit, field;
int level;
int ret;
 
@@ -2452,6 +2454,99 @@ static int try_to_fix_bad_block(struct 
btrfs_trans_handle *trans,
}
 
buf = path-nodes[level];
+
+   /* First, look for bitflips in keys: we identify these where k1 
+* k3 but k1 = k2 or k2 = k3. We can fix a bitflip if there's
+* exactly one bit that we can flip that makes k1  k2  k3. */
+   for (i = 0; i  btrfs_header_nritems(buf) - 2; i++) {
+   if (level) {
+   btrfs_node_key_to_cpu(buf, k1, i);
+   btrfs_node_key_to_cpu(buf, k2, i+1);
+   btrfs_node_key_to_cpu(buf, k3, i+2);
+   } else {
+   btrfs_item_key_to_cpu(buf, k1, i);
+   btrfs_item_key_to_cpu(buf, k2, i+1);
+   btrfs_item_key_to_cpu(buf, k3, i+2);
+   }
+
+   if (btrfs_comp_cpu_keys(k1, k3) = 0)
+   continue; /* Bracketing keys compare incorrectly:
+we can't fix this */
+   if (btrfs_comp_cpu_keys(k1, k2) = 0
+btrfs_comp_cpu_keys(k2, k3) = 0)
+   continue; /* All three keys are in order: nothing to do 
*/
+
+   bit = -1;
+   field = -1;
+   for(j = 0; j  64; j++) {
+   /* Look for flipped/fixable bits in the objectid */
+   k2.objectid ^= 0x1ULL  j;
+   if (btrfs_comp_cpu_keys(k1, k2) = 0
+btrfs_comp_cpu_keys(k2, k3) = 0) {
+   /* Do nothing if we've already found a 
flippable bit:
+* multiple solutions means we can't know what 
the
+* right thing to do is */
+   if (field != -1) {
+   field = -1;
+   break;
+   }
+   bit = j;
+   field = 0;
+   }
+   k2.objectid ^= 0x1ULL  j;
+
+   /* Look for flipped/fixable bits in the type */
+   if (j  8) {
+   k2.type ^= 0x1ULL  j;
+   if (btrfs_comp_cpu_keys(k1, k2) = 0
+btrfs_comp_cpu_keys(k2, k3) = 0) {
+   if (field != -1) {
+   field = -1;
+   break;
+   }
+   bit = j;
+   field = 1;
+   }
+   k2.type ^= 0x1ULL  j;
+   }
+
+   /* Look for flipped/fixable bits in the offset */
+   k2.offset ^= 0x1ULL  j

[[PATCH] 0/3] btrfs-check: Fix bitflipped keys from bad RAM

2014-05-05 Thread Hugo Mills
If you have RAM with stuck or unreliable bits in it, and a metadata block is
stored in it, you can end up with keys with errors in. These usually show up
as bad key order. In many cases, these out-of-order keys can be identified
and fixed with a simple heuristic. This patch series implements that
heuristic, and fixes a long-standing issue with the existing code to fix out-
of-order keys.

Hugo Mills (3):
  btrfs check: Fix wrong level access
  btrfs check: Pre-sort keys in a block while searching
  btrfs check: Attempt to fix misordered keys with bitflips in them

 cmds-check.c | 114 ++-
 ctree.c  |  86 
 ctree.h  |   2 ++
 3 files changed, 180 insertions(+), 22 deletions(-)

-- 
1.9.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thoughts on RAID nomenclature

2014-05-05 Thread Hugo Mills
   A passing remark I made on this list a day or two ago set me to
thinking. You may all want to hide behind your desks or in a similar
safe place away from the danger zone (say, Vladivostok) at this
point...

   If we switch to the NcMsPp notation for replication, that
comfortably describes most of the plausible replication methods, and
I'm happy with that. But, there's a wart in the previous proposition,
which is putting d for 2cd to indicate that there's a DUP where
replicated chunks can go on the same device. This was the jumping-off
point to consider chunk allocation strategies in general.

   At the moment, we have two chunk allocation strategies: dup and
spread (for want of a better word; not to be confused with the
ssd_spread mount option, which is a whole different kettle of
borscht). The dup allocation strategy is currently only available for
2c replication, and only on single-device filesystems. When a
filesystem with dup allocation has a second device added to it, it's
automatically upgraded to spread.

   The general operation of the chunk allocator is that it's asked for
locations for n chunks for a block group, and makes a decision about
where those chunks go. In the case of spread, it sorts the devices in
decreasing order of unchunked space, and allocates the n chunks in
that order. For dup, it allocates both chunks on the same device (or,
generalising, may allocate the chunks on the same device if it has
to).

   Now, there are other variations we could consider. For example:

 - linear, which allocates on the n smallest-numbered devices with
   free space. This goes halfway towards some people's goal of
   minimising the file fragments damaged in a device failure on a 1c
   FS (again, see (*)). [There's an open question on this one about
   what happens when holes open up through, say, a balance.]

 - grouped, which allows the administrator to assign groups to the
   devices, and allocates each chunk from a different group. [There's
   a variation here -- we could look instead at ensuring that
   different _copies_ go in different groups.]

   Given these four (spread, dup, linear, grouped), I think it's
fairly obvious that spread is a special case of grouped, where each
device is its own group. Then dup is the opposite of grouped (i.e. you
must have one or the other but not both). Finally, linear is a
modifier that changes the sort order.

   All of these options run completely independently of the actual
replication level selected, so we could have 3c:spread,linear
(allocates on the first three devices only, until one fills up and
then it moves to the fourth device), or 2c2s:grouped, with a device
mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
different copies on different device controllers.

   Does this all make sense? Are there any other options or features
that we might consider for chunk allocation at this point? Having had
a look at the chunk allocator, I think most if not all of this is
fairly easily implementable, given a sufficiently good method of
describing it all, which is what I'm trying to get to the bottom of in
this discussion.

   Hugo.

(*) The missing piece here is to deal with extent allocation in a
similar way, which would offer better odds again on the number of
files damaged in a device-loss situation on a 1c FS. This is in
general a much harder problem, though. The only change we have in this
area at the moment is ssd_spread, which doesn't do very much. It also
has the potential for really killing performance and/or file
fragmentation.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- ...  one ping(1) to rule them all, and in the ---  
 darkness bind(2) them.  


signature.asc
Description: Digital signature


Re: Btrfs raid allocator

2014-05-06 Thread Hugo Mills
On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:
 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk. So my question is, how is the block
 allocator deciding on which device to write, can this decision be dynamic
 and could it incorporate timing/troughput decisions? I'm willing to write
 code, I just have no clue as to how this works right now. I read somewhere
 that the decision is based on free space, is this still true?

   For (current) RAID-0 allocation, the block group allocator will use
as many chunks as there are devices with free space (down to a minimum
of 2). Data is then striped across those chunks in 64 KiB stripes.
Thus, the first block group will be N GiB of usable space, striped
across N devices.

   There's a second level of allocation (which I haven't looked at at
all), which is how the FS decides where to put data within the
allocated block groups. I think it will almost certainly be beneficial
in your case to use prealloc extents, which will turn your continuous
write into large contiguous sections of striping.

   I would recommend thoroughly benchmarking your application with the
FS first though, just to see how it's going to behave for you.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Ceci n'est pas une pipe:  | ---   


signature.asc
Description: Digital signature


Re: Btrfs raid allocator

2014-05-06 Thread Hugo Mills
On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:
 On 06.05.2014 12:59, Hugo Mills wrote:
 On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:
 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk. So my question is, how is the block
 allocator deciding on which device to write, can this decision be dynamic
 and could it incorporate timing/troughput decisions? I'm willing to write
 code, I just have no clue as to how this works right now. I read somewhere
 that the decision is based on free space, is this still true?
 
 For (current) RAID-0 allocation, the block group allocator will use
 as many chunks as there are devices with free space (down to a minimum
 of 2). Data is then striped across those chunks in 64 KiB stripes.
 Thus, the first block group will be N GiB of usable space, striped
 across N devices.
 
 So do I understand this correctly that (assuming we have enough space) data
 will be spread equally between the disks independend of write speeds? So one
 slow device would slow down the whole raid?

   Yes. Exactly the same as it would be with DM RAID-0 on the same
configuration. There's not a lot we can do about that at this point.

 There's a second level of allocation (which I haven't looked at at
 all), which is how the FS decides where to put data within the
 allocated block groups. I think it will almost certainly be beneficial
 in your case to use prealloc extents, which will turn your continuous
 write into large contiguous sections of striping.
 
 Why does prealloc change anything? For me latency does not matter, only
 continuous troughput!

   It makes the extent allocation algorithm much simpler, because it
can then allocate in larger chunks and do more linear writes

 I would recommend thoroughly benchmarking your application with the
 FS first though, just to see how it's going to behave for you.
 
 Hugo.
 
 
 Of course - it's just that I do not yet have the hardware, but I plan to
 test with a small model - I just try to find out how it actually works
 first, so I know what look out for.

   Good luck. :)

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- I am the author. You are the audience. I outrank you! --- 


signature.asc
Description: Digital signature


Re: Btrfs raid allocator

2014-05-06 Thread Hugo Mills
On Tue, May 06, 2014 at 01:26:44PM +0200, Hendrik Siedelmann wrote:
 On 06.05.2014 13:19, Hugo Mills wrote:
 On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:
 On 06.05.2014 12:59, Hugo Mills wrote:
 On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:
 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk. So my question is, how is the block
 allocator deciding on which device to write, can this decision be dynamic
 and could it incorporate timing/troughput decisions? I'm willing to write
 code, I just have no clue as to how this works right now. I read somewhere
 that the decision is based on free space, is this still true?
 
 For (current) RAID-0 allocation, the block group allocator will use
 as many chunks as there are devices with free space (down to a minimum
 of 2). Data is then striped across those chunks in 64 KiB stripes.
 Thus, the first block group will be N GiB of usable space, striped
 across N devices.
 
 So do I understand this correctly that (assuming we have enough space) data
 will be spread equally between the disks independend of write speeds? So one
 slow device would slow down the whole raid?
 
 Yes. Exactly the same as it would be with DM RAID-0 on the same
 configuration. There's not a lot we can do about that at this point.
 
 So striping is fixed but which disk takes part with a chunk is dynamic? But
 for large workloads slower disks could 'skip a chunk' as chunk allocation is
 dynamic, correct?

   You'd have to rewrite the chunk allocator to do this, _and_ provide
different RAID levels for different subvolumes. The chunk/block group
allocator right now uses only one rule for allocating data, and one
for allocating metadata. Now, both of these are planned, and _might_
between them possibly cover the use-case you're talking about, but I'm
not certain it's necessarily a sensible thing to do in this case.

   My question is, if you actually care about the performance of this
system, why are you buying some slow devices to drag the performance
of your fast devices down? It seems like a recipe for disaster...

 There's a second level of allocation (which I haven't looked at at
 all), which is how the FS decides where to put data within the
 allocated block groups. I think it will almost certainly be beneficial
 in your case to use prealloc extents, which will turn your continuous
 write into large contiguous sections of striping.
 
 Why does prealloc change anything? For me latency does not matter, only
 continuous troughput!
 
 It makes the extent allocation algorithm much simpler, because it
 can then allocate in larger chunks and do more linear writes
 
 Is this still true if I do very large writes? Or do those get broken down by
 the kernel somewhere?

   I guess it'll depend on the approach you use to do these very
large writes, and on the exact definition of very large. This is
not an area I know a huge amount about.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- I am the author. You are the audience. I outrank you! --- 


signature.asc
Description: Digital signature


Re: Please review and comment, dealing with btrfs full issues

2014-05-06 Thread Hugo Mills
On Tue, May 06, 2014 at 06:30:31PM +0200, Brendan Hide wrote:
 Hi, Marc. Inline below. :)
 
 On 2014/05/06 02:19 PM, Marc MERLIN wrote:
 On Mon, May 05, 2014 at 07:07:29PM +0200, Brendan Hide wrote:
 In the case above, because the filesystem is only 55% full, I can
 ask balance to rewrite all chunks that are more than 55% full:
 
 legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1
 
 -dusage=50 will balance all chunks that are 50% *or less* used,
 Sorry, I actually meant to write 55 there.
 
 not more. The idea is that full chunks are better left alone while
 emptyish chunks are bundled together to make new full chunks,
 leaving big open areas for new chunks. Your process is good however
 - just the explanation that needs the tweak. :)
 Mmmh, so if I'm 55% full, should I actually use -dusage=45 or 55?
 
 As usual, it depends on what end-result you want. Paranoid rebalancing -
 always ensuring there are as many free chunks as possible - is totally
 unnecessary. There may be more good reasons to rebalance - but I'm only
 aware of two: a) to avoid ENOSPC due to running out of free chunks; and b)
 to change allocation type.

   c) its original reason: to redistribute the data on the FS, for
   example in the case of a new device being added or removed.

 If you want all chunks either full or empty (except for that last chunk
 which will be somewhere inbetween), -dusage=55 will get you 99% there.
 In your last example, a full rebalance is not necessary. If you want
 to clear all unnecessary chunks you can run the balance with
 -dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of
 the data chunks that are 80% and less used, which would by necessity
 get about ~160GB worth chunks back out of data and available for
 re-use.
 So in my case when I hit that case, I had to use dusage=0 to recover.
 Anything above that just didn't work.
 
 I suspect when using more than zero the first chunk it wanted to balance
 wasn't empty - and it had nowhere to put it. Then when you did dusage=0, it
 didn't need a destination for the data. That is actually an interesting
 workaround for that case.

   I've actually looked into implementing a smallest=n filter that
would taken only the n least-full chunks (by fraction) and balance
those. However, it's not entirely trivial to do efficiently with the
current filtering code.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Hail and greetings.  We are a flat-pack invasion force from ---   
 Planet Ikea. We come in pieces. 


signature.asc
Description: Digital signature


Re: raid0 vs single, and should we allow -mdup by default on SSDs?

2014-05-07 Thread Hugo Mills
On Wed, May 07, 2014 at 01:18:40AM -0700, Marc MERLIN wrote:
 On Tue, May 06, 2014 at 07:39:12PM +, Duncan wrote:
  That appears to be a very good use of either -d raid0 or -d single, yes.  
  And since you're apparently not streaming such high resolution video that 
  you NEED the raid0, single does indeed give you a somewhat better chance 
  at recovery.
  
 zoneminder saves 'video' as a stream of independent small jpegs, so I'm
 good. Actually come to think of it they're so small that they probably
 all ended up in the raid1 metadata. That also means that I'm not getting
 twice the storage space like I planned to. Oh well...

   There's a mount option to change the threshold at which files are
inlined in metadata: maxinline=bytes. You could play with that for
this particular use-case.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- I am but mad north-north-west:  when the wind is southerly, I ---  
   know a hawk from a handsaw.   


signature.asc
Description: Digital signature


Smallest-n balance filter (was Re: Please review and comment, dealing with btrfs full issues)

2014-05-07 Thread Hugo Mills
On Wed, May 07, 2014 at 04:09:27PM +0200, David Sterba wrote:
 On Tue, May 06, 2014 at 05:43:24PM +0100, Hugo Mills wrote:
   So in my case when I hit that case, I had to use dusage=0 to recover.
   Anything above that just didn't work.
   
   I suspect when using more than zero the first chunk it wanted to balance
   wasn't empty - and it had nowhere to put it. Then when you did dusage=0, 
   it
   didn't need a destination for the data. That is actually an interesting
   workaround for that case.
  
 I've actually looked into implementing a smallest=n filter that
  would taken only the n least-full chunks (by fraction) and balance
  those. However, it's not entirely trivial to do efficiently with the
  current filtering code.
 
 I've prototyped something similar, to limit the number of balanced
 chunks by a number. To achieve n least-full chunks would be an
 iterative process of increasing the usage filter and limiting the number
 of chunks until the desired N is reached.
 
 N=n
 F=0
 while (N  0) {
   balance -dusage=F,limit=N
   N -= number of balanced chunks
   F++
 }
 
 The patch is in branch dev/balance-limit in my git repos.
 
 We can then implement the n-least-full as a synthetic filter from
 userspace.

   This is inefficient, because we've got an O(m) pass through all the
chunks for every call. If we reduce the number of calls by increasing
the increment of F (F+=3, say), then we risk overbalancing, or missing
out on smaller chunks we could have balanced earlier. From a practical
point of view, it may make little difference, but the computer
scientist in me is going ew.

   The other method, for small n only, would be to construct the list
first, an O(m log n) operation for a filesystem of size m, requiring
O(n) storage, and then iterate over just those chunks. The problem
with that is the storage requirements, and keeping track of the state
of the list for restart purposes. [actually, there's probably an O(m)
algorithm to get the n smallest items, but those are a bit
complicated]

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- A diverse working environment:  Di longer you vork here, di ---   
 verse it gets.  


signature.asc
Description: Digital signature


Re: Thoughts on RAID nomenclature

2014-05-08 Thread Hugo Mills
On Mon, May 05, 2014 at 10:17:38PM +0100, Hugo Mills wrote:
A passing remark I made on this list a day or two ago set me to
 thinking. You may all want to hide behind your desks or in a similar
 safe place away from the danger zone (say, Vladivostok) at this
 point...
 
If we switch to the NcMsPp notation for replication, that
 comfortably describes most of the plausible replication methods, and
 I'm happy with that. But, there's a wart in the previous proposition,
 which is putting d for 2cd to indicate that there's a DUP where
 replicated chunks can go on the same device. This was the jumping-off
 point to consider chunk allocation strategies in general.
 
At the moment, we have two chunk allocation strategies: dup and
 spread (for want of a better word; not to be confused with the
 ssd_spread mount option, which is a whole different kettle of
 borscht). The dup allocation strategy is currently only available for
 2c replication, and only on single-device filesystems. When a
 filesystem with dup allocation has a second device added to it, it's
 automatically upgraded to spread.
 
The general operation of the chunk allocator is that it's asked for
 locations for n chunks for a block group, and makes a decision about
 where those chunks go. In the case of spread, it sorts the devices in
 decreasing order of unchunked space, and allocates the n chunks in
 that order. For dup, it allocates both chunks on the same device (or,
 generalising, may allocate the chunks on the same device if it has
 to).
 
Now, there are other variations we could consider. For example:
 
  - linear, which allocates on the n smallest-numbered devices with
free space. This goes halfway towards some people's goal of
minimising the file fragments damaged in a device failure on a 1c
FS (again, see (*)). [There's an open question on this one about
what happens when holes open up through, say, a balance.]
 
  - grouped, which allows the administrator to assign groups to the
devices, and allocates each chunk from a different group. [There's
a variation here -- we could look instead at ensuring that
different _copies_ go in different groups.]
 
Given these four (spread, dup, linear, grouped), I think it's
 fairly obvious that spread is a special case of grouped, where each
 device is its own group. Then dup is the opposite of grouped (i.e. you
 must have one or the other but not both). Finally, linear is a
 modifier that changes the sort order.
 
All of these options run completely independently of the actual
 replication level selected, so we could have 3c:spread,linear
 (allocates on the first three devices only, until one fills up and
 then it moves to the fourth device), or 2c2s:grouped, with a device
 mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
 different copies on different device controllers.

   Having thought about this some more(*), what I've described above
isn't quite right. We've got two main axes for the algorithm, and one
flag (which modifies the second axis's behaviour).

   The first axis is selection of a suitable device from a list of
candidates. I've renamed things from my last email to try to make
things clearer, but example algorithms here could be:

 - first:   The old algorithm, which simply selects the available device
with the smallest device ID.

 - even:The current algorithm, which selects the available device
with the largest free space.

 - round-robin: A stateful selection, which selects the next available
device after the last one selected.

 - forward: Like first, only if a device becomes full, we don't go
back to look at it until we've exhausted all the other
devices first. This approximates a ring-buffer type
structure (only where we only fill in gaps, obviously).

 - seq: Like first, only with a user-supplied arbitrary ordering of
devices.

   These can be stateful (r-r and forward are), and each function may
be called multiple times for each block group: once per chunk
required. I don't necessarily propose to implement all of these, but
at least even and first, and possibly seq, seem sensible.

   After selecting a device on which to create a chunk, we then need
to winnow out the devices which are no longer suitable for selection
as a result of that allocation. This is the other axis, and the
behaviours here are:

 - any: The current behaviour. Only the selected device is removed
from consideration.

 - fast, slow: Like any, except that devices which are,
respectively, slow and fast are put in a special don't use
group which prevents them from being allocated at all.

 - grouped: All devices within the group of the selected device are
removed from consideration. This allows us to specify, for
example, that different copies in Nc configurations should go
on different controllers(**). It also

Re: Thoughts on RAID nomenclature

2014-05-08 Thread Hugo Mills
On Thu, May 08, 2014 at 04:58:34PM +0100, Hugo Mills wrote:
The first axis is selection of a suitable device from a list of
 candidates. I've renamed things from my last email to try to make
 things clearer, but example algorithms here could be:
 
  - first:   The old algorithm, which simply selects the available device
 with the smallest device ID.
 
  - even:The current algorithm, which selects the available device
 with the largest free space.
 
  - round-robin: A stateful selection, which selects the next available
 device after the last one selected.
 
  - forward: Like first, only if a device becomes full, we don't go
 back to look at it until we've exhausted all the other
 devices first. This approximates a ring-buffer type
 structure (only where we only fill in gaps, obviously).
 
  - seq: Like first, only with a user-supplied arbitrary ordering of
 devices.
 
These can be stateful (r-r and forward are), and each function may
 be called multiple times for each block group: once per chunk
 required. I don't necessarily propose to implement all of these, but
 at least even and first, and possibly seq, seem sensible.
 
After selecting a device on which to create a chunk, we then need
 to winnow out the devices which are no longer suitable for selection
 as a result of that allocation. This is the other axis, and the
 behaviours here are:
 
  - any: The current behaviour. Only the selected device is removed
 from consideration.
 
  - fast, slow: Like any, except that devices which are,
 respectively, slow and fast are put in a special don't use
 group which prevents them from being allocated at all.
 
  - grouped: All devices within the group of the selected device are
 removed from consideration. This allows us to specify, for
 example, that different copies in Nc configurations should go
 on different controllers(**). It also allows us to specify
 that metadata chunks should only go on specific devices.
 
  - dup: may be applied to any of the other winnowing functions, and
 simply forces that function to put its discarded devices to
 the back of the queue of possible devices again, allowing them
 to be reused.
 
So, current (and strongly recommended) behaviour is even,any or
 even,any+dup. The linear allocation which is sometimes requested would
 be first,any -- this is the same as the old allocator from a few years
 ago. Allocating different copies to different controllers might be:
 even,grouped with groups A=sda,sdb,sdc;B=sdd,sde,sdf. Note that any,
 fast and slow are all special cases of grouped. It's worth
 noting, as I've just discovered, that it's possible to configure this
 system to do some _really_ silly suboptimal things with chunk
 allocation.
 
I've got the algorithms for all this coded in python (mostly --
 I've got a couple of things to finish off), to verify that it turns
 into a sane implementation at least. It does.

   ... and here's the code. Usage:

alloc repl allocator [groups] dev_size ...

   Examples:

$ ./alloc 2c even 1000 1000 2000 # Current behaviour RAID-1
$ ./alloc 2c even,any,dup 1000 1000 2000 # Current behaviour DUP
$ ./alloc 2cMs first 100 100 100 200 200 # RAID-10, first-found allocator
$ ./alloc 1c3s1p rr,grouped A=0,1:B=2,3 100 100 100 100

   Allocators are {even,forward,first,rr},{any,grouped},{distinct,dup}.

   If grouped is selected, you must supply a group mapping:
groupid=dev,dev,...:groupid=dev,dev,...:... A groupid of
. prevents use of that group entirely.

   Device sizes are scaled to a maximum of 24, for display purposes.

   Hugo.

#!/usr/bin/python3

import sys
import itertools

# Device list filters: filter the sequence of eligible devices based
# on the selected device. Return the filtered sequence

# winn_any is the existing algorithm: it disallows a device from
# being used again if it's already been used once
def winn_any(seq, dev, dgrp, dup):
seq.remove(dev)
if dup:
seq.append(dev)
return seq

# winn_grouped uses group definitions, and disallows a device if any
# device in its group has already been used.
def winn_grouped(seq, dev, dgrp, dup):
removed = [d for d in seq if dgrp[d.id] == dgrp[dev.id]]
seq = [d for d in seq if dgrp[d.id] != dgrp[dev.id]]

if dup:
seq += removed

return seq

# Selection algorithms: given a list of devices and a state, return a
# device (or None), and an updated state

# seq_even is the existing algorithm: pick devices with the largest
# amount of free space
def seq_even(devs, state):
dev = None
for d in devs:
if dev is None or d.free  dev.free:
dev = d

if dev is not None:
return dev, dev.get_first_from(0), state
else:
return None, 0, state

# seq_rr does a round-robin through the free devices
def seq_rr(devs

Re: destroyed disk in btrfs raid

2014-05-09 Thread Hugo Mills
On Fri, May 09, 2014 at 08:02:45PM +0200, laie wrote:
 Hello!
 
 I've some trouble with my btrf filesystem. I've lost one backing raid
 device, it's luks header is overwritten and not restoreable.
 
 The lost disk was recently added. 'btrfs filesystem balace' was running for
 some time, but the new device is the smallest in the set.
 
 Data is stored with Raid0, Metadata with Raid1. Degraded mounting works
 fine.
 
 Now I'm looking for a way to tell btrfs to provide me with a list of the
 corrupted files and delete them afterwards. This would be great, because
 otherwise it would take very long to get the data back from slow backups.

   Simple solution: cat every file to /dev/null, and see which ones
fail with an I/O error. With RAID-0 data, losing a device is going to
damage most files, though, so don't necessarily expect much to survive.

   Hugo.

 Thanks in advance
 Max
 
 
 btrfs --version
 Btrfs v3.12
 
 btrfs fi show
 Label: userspace  uuid: there is one ;)
 Total devices 3 FS bytes used 27.21TiB
 devid1 size 21.83TiB used 13.03TiB path /dev/dm-3
 devid2 size 16.37TiB used 13.01TiB path /dev/dm-2
 devid3 size 8.19TiB used 4.27TiB path
 
 btrfs fi df /home/
 Data, RAID0: total=30.24TiB, used=27.18TiB
 System, RAID1: total=32.00MiB, used=1.99MiB
 System, single: total=4.00MiB, used=0.00
 Metadata, RAID1: total=32.00GiB, used=31.42GiB
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- UDP jokes: It's OK if no-one gets them. --- 


signature.asc
Description: Digital signature


Re: destroyed disk in btrfs raid

2014-05-09 Thread Hugo Mills
On Fri, May 09, 2014 at 06:58:27PM +0100, Hugo Mills wrote:
 On Fri, May 09, 2014 at 08:02:45PM +0200, laie wrote:
  Now I'm looking for a way to tell btrfs to provide me with a list of the
  corrupted files and delete them afterwards. This would be great, because
  otherwise it would take very long to get the data back from slow backups.
 
Simple solution: cat every file to /dev/null, and see which ones
 fail with an I/O error. With RAID-0 data, losing a device is going to
 damage most files, though, so don't necessarily expect much to survive.

   Actually, you could cat just the first 256 KiB of each file to
/dev/null -- that should be sufficient, because with RAID-0, the
stripe size is 64 KiB, and 256 KiB is therefore 4 stripes, and so
should cover every device... :)

   Hugo.

  btrfs fi show
  Label: userspace  uuid: there is one ;)
  Total devices 3 FS bytes used 27.21TiB
  devid1 size 21.83TiB used 13.03TiB path /dev/dm-3
  devid2 size 16.37TiB used 13.01TiB path /dev/dm-2
  devid3 size 8.19TiB used 4.27TiB path

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- UDP jokes: It's OK if no-one gets them. --- 


signature.asc
Description: Digital signature


Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)

2014-05-09 Thread Hugo Mills
On Fri, May 09, 2014 at 05:42:54PM -0700, Marc MERLIN wrote:
 On Sat, May 10, 2014 at 10:13:43AM +1000, Chris Samuel wrote:
   Right now, I do see:
   legolas:~# cat /proc/sys/kernel/tainted
   512
  
  IIUC that's an array of bit flags, and that value means you've had a 
  previous 
  kernel warning at that point according to:
  
  https://www.kernel.org/doc/Documentation/sysctl/kernel.txt
 
 Yep, I meant to say that I don't have the 'G' now.

   G is actually good, I think. IIRC, it's everything we've had to
this point has been under a license where we have the source
available. It's when you load a proprietary module that you get the P
and the G goes away.

 It's likely that vbox did 'G' even if I didn't successfully start it,
 and even if I haven't had problems with it 'till now, it's a possible
 culprit (more details below)

   I think G is actually a default state, and is good.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- I write in C because using pointer arithmetic lets people ---
   know that you're virile. -- Matthew Garrett   


signature.asc
Description: Digital signature


Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)

2014-05-09 Thread Hugo Mills
On Fri, May 09, 2014 at 06:00:50PM -0600, Chris Murphy wrote:
 Well I'm sorta dense, so I only find a complete dmesg useful because
 with storage problems it seems much is due to some other problem
 happening earlier. 

   Life would be so much easier if filesystems didn't store any
persistent state... :)

   The number of people who don't quite get that that's the function
and natural behaviour of a filesystem is... surprising. 

   As in, Your filesystem got corruption as a result of a bug in some
earlier version. Upgrading to the new version isn't magically going to
make that corruption go away. (Not saying that's what's happened
here, but it's common, and commonly misunderstood).

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- The makers of Steinway pianos would like me to tell you that ---   
  this is a Bechstein.   


signature.asc
Description: Digital signature


Re: known UUID and metadata consistency

2014-05-11 Thread Hugo Mills
On Sun, May 11, 2014 at 02:16:27PM +1000, Russell Coker wrote:
 One of the problems with ReiserFS was that a fsck --rebuild-tree would look 
 through all the disk contents for blocks that appeared to be metadata. A 
 hostile user could create a file in their home directory (or /tmp or anywhere 
 else) that contained ReiserFS metadata which would be linked into the 
 filesystem on a --rebuild-tree, by doing that I created a SUID root file as 
 non-root on a ReiserFS system. Also ReiserFS was inherently unsuitable for 
 storing filesystem images as that could mess up the filesystem.
 
 I believe that BTRFS uses a UUID for each filesystem that is included in 
 every 
 metadata block which will make it very unlikely that two runs of mkfs.btrfs 
 will result in the same UUID.  Therefore there should be no risk of a 
 filesystem image stored in a file messing up a filesystem.
 
 Is it possible for a hostile user to create a file that could get linked in 
 to 
 a filesystem by btrfsck?  /dev/disk/by-uuid/ is world readable so the UUID of 
 the filesystem is not secret.  Apart from the fact that a correct tree (which 
 can be verified by checksums) won't link to data blocks as metadata what 
 assurance do we have that hostile or corrupt data blocks can't be treated as 
 metadata?

   Possible, but very difficult to accomplish in an undetectable
way.

   I think there's two cases here:

a) The chunk tree is whole and consistent
b) We need to (or have been asked to) rebuild the chunk tree from scratch

   In the first case, btrfsck has all the pointers to the chunk tree
block(s) from the superblock, and therefore knows which areas of the
disk(s) are meant to be metadata and which are meant to be data. In
order for a data block to end up being considered by btrfsck, it would
have to be turned into metadata. This is difficult, but not impossible
-- it can happen if it's written as data, the FS is balanced so the
data chunk is moved elsewhere, a metadata chunk is written in place,
and none of the metadata blocks moved into that chunk overwrites the
evil block. _Then_ btrfsck would have to be scanning metadata for tree
blocks, which it generally doesn't do, I think.

   In order for this to be dangerous, we would then need to have
something pointing at this evil block (which in a good filesystem it
wouldn't), or to have some detailed scan pick it up and attempt to
incorporate it. This means that it's going to have to have things like
the correct (or nearly-correct) transid, and it's got to have good
pointers to other metadata blocks -- a task complicated by the fact
that in order to get to this point, we've just balanced the FS which
has moved *everything* around to new addresses.

   In the second case, it's a much simpler case for an attacker -- you
could potentially construct a whole new chunk tree (typically only a
few blocks in size) with a higher transid than the original FS. If the
chunk tree is sufficiently badly-broken that it can't be read, then a
full disk scan (e.g. btrfs-find-root) would turn up the fake as well,
I think. You'd probably have to construct a whole new load of metadata
as well (at least some of the other trees), and ensure that it all
linked together with the right addresses.

   So I think, yes, you can make fake metadata blocks, but you have to
rely on either (a) your fake data being written to exactly the right
place and not overwritten by a balance, or (b) having the chunk tree
broken and the admin looking for a usable set of metadata with
btrfs-find-root and choosing your fake (with enough plausible complex
data in it to fool them).

   I've probably missed loads of points here, but I think this is a
good start for the conversation. :)

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Anyone who claims their cryptographic protocol is secure is ---   
 either a genius or a fool.  Given the genius/fool ratio 
 for our species,  the odds aren't good. 


signature.asc
Description: Digital signature


Re: RAID10 across different sized disks shows data layout as single not RAID10

2014-05-11 Thread Hugo Mills
On Sun, May 11, 2014 at 05:53:40PM +1000, brett.k...@commandict.com.au wrote:
 Hi,

 I created a RAID10 array of 4x 4TB disks and later added another 4x
 3TB disks, expecting the result to be the same level of fault
 tolerance however with simply more capacity. Recently I noticed the
 output of 'btrfs fi df' lists the Data layout as 'single' and not
 RAID10 per my initial mkfs.btrfs -d raid10 -m raid10 /dev/...
 command.

   That's odd. Was it fully RAID-10 before you added the other
devices? Looking at the btrfs fi df output, there's no vestigial
single chunks for your metadata, so it's been balanced at least
once. What can happen is that if the FS is balanced when new (i.e.
with no data in the data chunk -- so touch foo isn't sufficient),
the data chunk(s) are removed because there's no data in them. With no
data chunks at all, the FS then can't guess what type it should be
using, and falls back to single.

 Is this single data layout due to the overall inconsistent disk size
 used ? e.g. it can no longer fully stripe across all disks hence
 simply concatenates the subsequent smaller disks and displays this
 as an overall 'single' Data layout.

   No, it should be fine. With a balanced RAID-10 in your case, it
will fill up all 8 devices equally, until the smaller ones are full,
and then drop from 8 devices per stripe to 4, and continue to fill up
the remaining devices.

 I require fault tolerance hence ultimately want to know if I
 actually do have a RAID10 data layout, else should try perhaps a
 'btrfs fi balance start -dconvert=raid10 /export' (assuming enough
 free space exists).

   Yes, that would be the thing to do. Note that you'll be _very_
close to full (of not actually full) after doing that, based on the
figures you've quoted below. You have 4*3.64 + 4*2.73 = 25.48 TiB of
raw space, which works out as 12.74 TiB of usable space under RAID-10,
so you're within 100 GiB of full. I'd suggest, if you can, shifting
100 GiB or so of data off to somewhere else temporarily while the
balance runs, just in case.

 I also noticed there are 2x System layouts shown, which leads me to
 think perhaps the first disks (4x4TB) are laid out as RAID10 for
 Data however the subsequent disks (4x3TB) are simply concatenated,
 giving me hopefully a limited level of fault tolerance for now.

   You'll note that the System/single is empty -- this is left over
from the mkfs process. There would originally have been similar small
empty chunks for Data and Metadata, but these will have gone away on
the first balance.

   As it stands, though, your data is not fault tolerant at all -- but
you're in with a good chance of recovering quite a lot of it if one
disk fails.

   Hugo.

 [root@array ~]# uname -a
 Linux array.commandict.com.au 3.14.2-200.fc20.x86_64 #1 SMP Mon Apr 28 
 14:40:57 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 [root@array ~]# btrfs --version
 Btrfs v3.12
 [root@array ~]# btrfs fi show
 Label: export  uuid: 22c7663a-93ca-40a6-9491-26abaa62b924
 Total devices 8 FS bytes used 12.66TiB
 devid1 size 3.64TiB used 2.12TiB path /dev/sda
 devid2 size 3.64TiB used 2.12TiB path /dev/sde
 devid3 size 3.64TiB used 2.12TiB path /dev/sdi
 devid4 size 3.64TiB used 2.12TiB path /dev/sdg
 devid5 size 2.73TiB used 1.21TiB path /dev/sdb
 devid6 size 2.73TiB used 1.21TiB path /dev/sdf
 devid7 size 2.73TiB used 1.21TiB path /dev/sdh
 devid8 size 2.73TiB used 1.21TiB path /dev/sdj
 
 Btrfs v3.12
 [root@array ~]# btrfs fi df /export
 Data, single: total=13.25TiB, used=12.65TiB
 System, RAID10: total=64.00MiB, used=1.41MiB
 System, single: total=4.00MiB, used=0.00
 Metadata, RAID10: total=19.00GiB, used=16.47GiB
 [root@array ~]#
 
 Thanks in advance,
 Brett.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- The glass is neither half-full nor half-empty; it is twice as ---  
large as it needs to be. 


signature.asc
Description: Digital signature


Re: destroyed disk in btrfs raid

2014-05-11 Thread Hugo Mills
On Tue, May 13, 2014 at 10:16:59AM +0200, laie wrote:
 On 2014-05-09 20:01, Hugo Mills wrote:
 On Fri, May 09, 2014 at 06:58:27PM +0100, Hugo Mills wrote:
 On Fri, May 09, 2014 at 08:02:45PM +0200, laie wrote:
  Now I'm looking for a way to tell btrfs to provide me with a list of the
  corrupted files and delete them afterwards. This would be great, because
  otherwise it would take very long to get the data back from slow backups.
 
Simple solution: cat every file to /dev/null, and see which ones
 fail with an I/O error. With RAID-0 data, losing a device is going to
 damage most files, though, so don't necessarily expect much to survive.
 
Actually, you could cat just the first 256 KiB of each file to
 /dev/null -- that should be sufficient, because with RAID-0, the
 stripe size is 64 KiB, and 256 KiB is therefore 4 stripes, and so
 should cover every device... :)
 
 I was hoping for an internal tool, but I guess the matter is to specific.
 Thank you for the fast and easy solution.
 
 Are you sure that using only 256KiB covers also huge files, even if only the
 last Part is missing?

   Aah, thinking about it harder, since you had a partial balance,
there will be some block groups that are on n-1 devices, and some that
are on n devices. This means that you could have some files with parts
on unbalanced block groups (so, probably safe) and parts on balanced
block groups (probably damaged).

   So go back to my earlier suggestion: cat every file completely.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- The trouble with you, Ibid, is you think you know everything. ---  
 


signature.asc
Description: Digital signature


Re: Error in btrfs wiki - How much space will I get with my multi-device configuration?

2014-05-14 Thread Hugo Mills
On Wed, May 14, 2014 at 09:49:48AM +0100, Astro Xe wrote:
 The content of the FAQ How much space will I get with my multi-device 
 configuration? 
 (https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_space_will_I_get_with_my_multi-device_configuration.3F)
  is currently wrong. The usable space is the sum of the space of the devices. 
 I'm using multi-device btrfs (data: single, metadata: single or DUP) on 
 kernels from 3.2 to 3.14 and I have never seen the behavior described in the 
 answer.

   It's correct, but for the *previous* question. I'm not quite sure
how that got like that, but I've fixed it now.

   Hugo.

 Please remove this question.
 
 
 Thanks,
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Keming (n.) The result of poor kerning ---  


signature.asc
Description: Digital signature


Re: destroyed disk in btrfs raid

2014-05-14 Thread Hugo Mills
On Wed, May 14, 2014 at 08:43:41PM +0200, laie wrote:
 On 2014-05-11 16:19, Hugo Mills wrote:
 On Tue, May 13, 2014 at 10:16:59AM +0200, laie wrote:
 On 2014-05-09 20:01, Hugo Mills wrote:
 On Fri, May 09, 2014 at 06:58:27PM +0100, Hugo Mills wrote:
 On Fri, May 09, 2014 at 08:02:45PM +0200, laie wrote:
  Now I'm looking for a way to tell btrfs to provide me with a list of the
  corrupted files and delete them afterwards. This would be great, because
  otherwise it would take very long to get the data back from slow 
  backups.
 
Simple solution: cat every file to /dev/null, and see which ones
 fail with an I/O error. With RAID-0 data, losing a device is going to
 damage most files, though, so don't necessarily expect much to survive.
 
 I finished building the List, about 40% of the Data is gone. So far so good.
 
 As next step I planned to delete these files. This is not possible because
 I'm not able to mount the fs r/w.
 
 btrfs: allowing degraded mounts
 btrfs: bdev /dev/mapper/luks-0 errs: wr 37519, rd 32783, flush 0, corrupt 0,
 gen 0
 Btrfs: too many missing devices, writeable mount is not allowed
 btrfs: open_ctree failed
 
 Is it correct remove the missing device now:
 
 btrfs device delete missing /mnt
 
 Or do I have to add the replacement first?

   You'd have to mount r/w before you can add a new disk. :)

   You should be able to mount r/w using the -o degraded mount option.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Attempted murder, now honestly, what is that?  Do they give a ---  
  Nobel Prize for attempted chemistry?   


signature.asc
Description: Digital signature


Re: staggered stripes

2014-05-15 Thread Hugo Mills
On Thu, May 15, 2014 at 07:00:10PM +1000, Russell Coker wrote:
 http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.html
 
 Page 13 of the above paper says:
 
 # Figure 12 presents for each block number, the number of disk drives of disk
 # model ‘E-1’ that developed a checksum mismatch at that block number. We see
 # in the figure that many disks develop corruption for a specific set of block
 # numbers. We also verified that (i) other disk models did not develop
 # multiple check-sum mismatches for the same set of block numbers (ii) the
 # disks that developed mismatches at the same block numbers belong to
 # different storage systems, and (iii) our software stack has no specific data
 # structure that is placed at the block numbers of interest.
 #
 # These observations indicate that hardware or firmware bugs that affect
 # specific sets of block numbers might exist. Therefore, RAID system designers
 # may be well-advised to use staggered stripes such that the blocks that form
 # a stripe (providing the required redundancy) are placed at different block
 # numbers on different disks.
 
 Does the BTRFS RAID functionality do such staggered stripes?  If not could it 
 be added?

   Yes, it could, by simply shifting around the chunk locations at
allocation time. I'm working in this area at the moment, and I think
it should be feasible within the scope of what I'm doing. I'll add it
to my list of things to look at.

   Hugo.

 I guess there's nothing stopping a sysadmin from allocating an unused 
 partition at the start of each disk and use a different size for each disk.  
 But I think it would be best to do this inside the filesystem.
 
 Also this is another reason for having DUP+RAID-1.
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- If you're not part of the solution, you're part --- 
   of the precipiate.


signature.asc
Description: Digital signature


Re: staggered stripes

2014-05-15 Thread Hugo Mills
On Fri, May 16, 2014 at 12:38:04AM +1000, Russell Coker wrote:
 On Thu, 15 May 2014 09:31:42 Duncan wrote:
   Does the BTRFS RAID functionality do such staggered stripes?  If not
   could it be added?
  
  AFAIK nothing like that yet, but it's reasonably likely to be implemented
  later.  N-way-mirroring is roadmapped for next up after raid56
  completion, however.
 
 It's RAID-5/6 when we really need such staggering.  It's a reasonably common 
 configuration choice to use two different brands of disk for a RAID-1 array.  
 As the correlation between parts of the disks with errors only applied to 
 disks of the same make and model (and this is expected due to 
 firmware/manufacturing issues) the people who care about such things on 
 RAID-1 
 have probably already dealt with the issue.
 
  You do mention the partition alternative, but not as I'd do it for such a
  case.  Instead of doing a different sized buffer partition (or using the
  mkfs.btrfs option to start at some offset into the device) on each
  device, I'd simply do multiple partitions and reorder them on each
  device.
 
 If there are multiple partitions on a device then that will probably make 
 performance suck.  Also does BTRFS even allow special treatment of them or 
 will it put two copies from a RAID-10 on the same disk?

   It will do. However, we should be able to fix that with the new
allocator, if I ever get it finished...

   Hugo.

  Tho N-way-mirroring would sure help here too, since if a given
  area around the same address is assumed to be weak on each device, I'd
  sure like greater than the current 2-way-mirroring, even if if I had a
  different filesystem/partition at that spot on each one, since with only
  two-way-mirroring if one copy is assumed to be weak, guess what, you're
  down to only one reasonably reliable copy now, and that's not a good spot
  to be in if that one copy happens to be hit by a cosmic ray or otherwise
  fail checksum, without another reliable copy to fix it since that other
  copy is in the weak area already.
  
  Another alternative would be using something like mdraid's raid10 far
  layout, with btrfs on top of that...
 
 In the copies= option thread Brendan Hide stated that this sort of thing is 
 planned.
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Stick them with the pointy end. --- 


signature.asc
Description: Digital signature


Re: [PATCH 3/3] btrfs check: Attempt to fix misordered keys with bitflips in them

2014-05-16 Thread Hugo Mills
On Fri, May 16, 2014 at 04:22:36PM +0200, David Sterba wrote:
 On Mon, May 05, 2014 at 06:07:51PM +0100, Hugo Mills wrote:
  If precisely one of those bitflips puts the broken key back into order
  relative to its two neighbours, we probably have a fix for the bitflip,
  and so we write it back to the FS.
 
 This sounds safe enough to me.  I'll add the patch to integration but
 before I push it further upstream I'd really like to see the bitflip fix
 in action, so if you already have testing images, please let me know.

   Here's the one I mostly used to test with -- it's a 32 GiB sparse
full filesystem image, with a file full of zeroes in it. It has a
single bitflip in the csum tree created by hand with a hex editor, and
then the csum fixed up afterwards, again by hand.

   Hugo.

[1] http://carfax.org.uk/files/temp/testfs.img.tar.gz

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- I don't care about it works on my machine. We are not --- 
 shipping your machine.  


signature.asc
Description: Digital signature


Re: ditto blocks on ZFS

2014-05-17 Thread Hugo Mills
On Sat, May 17, 2014 at 01:50:52PM +0100, Martin wrote:
 On 16/05/14 04:07, Russell Coker wrote:
  https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
  
  Probably most of you already know about this, but for those of you who 
  haven't 
  the above describes ZFS ditto blocks which is a good feature we need on 
  BTRFS.  The briefest summary is that on top of the RAID redundancy there...
 [... are additional copies of metadata ...]
 
 
 Is that idea not already implemented in effect in btrfs with the way
 that the superblocks are replicated multiple times, ever more times, for
 ever more huge storage devices?

   Superblocks are the smallest part of the metadata. There's a whole
load of metadata that's not in the superblocks that isn't replicated
in this way.

 The one exception is for SSDs whereby there is the excuse that you
 cannot know whether your data is usefully replicated across different
 erase blocks on a single device, and SSDs are not 'that big' anyhow.
 
 
 So... Your idea of replicating metadata multiple times in proportion to
 assumed 'importance' or 'extent of impact if lost' is an interesting
 approach. However, is that appropriate and useful considering the real
 world failure mechanisms that are to be guarded against?
 
 Do you see or measure any real advantage?

   This. How many copies do you actually need? Are there concrete
statistics to show the marginal utility of each additional copy?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- IMPROVE YOUR ORGANISMS!!  -- Subject line of spam email --- 


signature.asc
Description: Digital signature


Re: [PATCH 00/27] Replace the old man page with asciidoc and man page for each btrfs subcommand.

2014-05-17 Thread Hugo Mills
On Wed, Apr 16, 2014 at 07:12:19PM +0200, David Sterba wrote:
 On Wed, Apr 02, 2014 at 04:29:11PM +0800, Qu Wenruo wrote:
  Convert the old btrfs man pages to new asciidoc and split the huge
  btrfs man page into subcommand man page.
 
 I'm merging this patchset into the base series of integration because
 several patches need to update the docs and it's no longer feasible to
 keep it in a separate branch from the patches.

   I've just been poking around in the docs for a completely different
reason, and I think there's a fairly serious problem (well, as serious
as problems get with documentation).

   Take, for example, the format for btrfs fi resize:

'resize' [devid:][+/-]size[gkm]|[devid:]max path::

   Now, this has just thrown away all of the useful markup which
indicates the semantics of the command. The asciidoc renders all of
that text literally and unformatted, making alphasymbolic(*) soup of
the docs. Compare this to the old roff man page:

\fBbtrfs\fP \fBfilesystem resize\fP 
[\fIdevid\fP:][+/\-]\fIsize\fP[gkm]|[\fIdevid\fP:]\fImax path\fP

   This isn't perfect -- we're missing a \fB around the max -- but
it has text in bold(⁑) and italics(⁂) and neither(☃). I've just looked
at some of the other pages, and they've also got similar typographical
problems. This is a lot of fiddly tedious work to get it right, and if
it doesn't get done now in the initial commit, then we're going to end
up with poor examples copied for every new feature or docs update,
making the problem worse before anyone does the work to make it
better.

   Hugo.

(*) Or possibly alphashambolic. :)
(⁑) For literal text
(⁂) For variables that require substitution by the user
(☃) For structural syntax indicators such as [] for optional parts, |
for alternation and ... to indicate optional continuation of a list

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- There isn't a noun that can't be verbed. --- 


signature.asc
Description: Digital signature


Re: [PATCH 00/27] Replace the old man page with asciidoc and man page for each btrfs subcommand.

2014-05-17 Thread Hugo Mills
On Sat, May 17, 2014 at 06:43:15PM +0100, Hugo Mills wrote:
 On Wed, Apr 16, 2014 at 07:12:19PM +0200, David Sterba wrote:
  On Wed, Apr 02, 2014 at 04:29:11PM +0800, Qu Wenruo wrote:
   Convert the old btrfs man pages to new asciidoc and split the huge
   btrfs man page into subcommand man page.
  
  I'm merging this patchset into the base series of integration because
  several patches need to update the docs and it's no longer feasible to
  keep it in a separate branch from the patches.
 
I've just been poking around in the docs for a completely different
 reason, and I think there's a fairly serious problem (well, as serious
 as problems get with documentation).
 
Take, for example, the format for btrfs fi resize:
 
 'resize' [devid:][+/-]size[gkm]|[devid:]max path::
 
Now, this has just thrown away all of the useful markup which
 indicates the semantics of the command. The asciidoc renders all of
 that text literally and unformatted, making alphasymbolic(*) soup of
 the docs. Compare this to the old roff man page:
 
 \fBbtrfs\fP \fBfilesystem resize\fP 
 [\fIdevid\fP:][+/\-]\fIsize\fP[gkm]|[\fIdevid\fP:]\fImax path\fP
 
This isn't perfect -- we're missing a \fB around the max -- but
 it has text in bold(⁑) and italics(⁂) and neither(☃). I've just looked
 at some of the other pages, and they've also got similar typographical
 problems. This is a lot of fiddly tedious work to get it right, and if
 it doesn't get done now in the initial commit, then we're going to end
 up with poor examples copied for every new feature or docs update,
 making the problem worse before anyone does the work to make it
 better.

   Oh, and asciidoc appears to be the most horrible capricious
inconsistent parser in existence. I've just spent 5 minutes getting
this one line of text to do what I want it to:

=
__N__**c**[\[_Mmin_**-**]__Mmax__**s**[_P_**p**]]
=

   I had to run through the list of block quote operators one at a
time in order to find the one I needed for this (the =); it's
still not indenting it correctly on the resulting man page.

   Note also the fun things like the fact that [[]] is special, so you
have to quote the opening part of it -- but if you try quoting the
first [ with a \ you get a literal \[ in the output. You get the right
output from quoting the *second* [ only.

   The Nc can only be italicised and emboldened properly with __ and
** because _ and * require whitespace around them in order to work
(seriously, WTF?). However, we can't be consistent with that in the
_Mmin_**-** because the quoted \[ appears to count as whitespace, so
using __Mmin__ gives us a leading literal _. The closing __ appears to
close the single opening _ correctly in that case, though.

   Seriously, this is meant to be _easy_ to use? I think I'd rather
type docbook by hand that have to struggle with this. Even the troff
macros for man pages are simpler to get right.

   Hugo.

Hugo.
 
 (*) Or possibly alphashambolic. :)
 (⁑) For literal text
 (⁂) For variables that require substitution by the user
 (☃) For structural syntax indicators such as [] for optional parts, |
 for alternation and ... to indicate optional continuation of a list
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Great oxymorons of the world, no. 1: Family Holiday ---   


signature.asc
Description: Digital signature


Re: [PATCH 00/27] Replace the old man page with asciidoc and man page for each btrfs subcommand.

2014-05-18 Thread Hugo Mills
On Sun, May 18, 2014 at 02:51:39PM +0800, Qu Wenruo wrote:
 
  Original Message 
 Subject: Re: [PATCH 00/27] Replace the old man page with asciidoc and man
 page for each btrfs subcommand.
 From: Hugo Mills h...@carfax.org.uk
 To: dste...@suse.cz, Qu Wenruo quwen...@cn.fujitsu.com,
 linux-btrfs@vger.kernel.org, c...@fb.com
 Date: 2014年05月18日 01:43
 On Wed, Apr 16, 2014 at 07:12:19PM +0200, David Sterba wrote:
 On Wed, Apr 02, 2014 at 04:29:11PM +0800, Qu Wenruo wrote:
 Convert the old btrfs man pages to new asciidoc and split the huge
 btrfs man page into subcommand man page.
 I'm merging this patchset into the base series of integration because
 several patches need to update the docs and it's no longer feasible to
 keep it in a separate branch from the patches.
 I've just been poking around in the docs for a completely different
 reason, and I think there's a fairly serious problem (well, as serious
 as problems get with documentation).
 
 Take, for example, the format for btrfs fi resize:
 
 'resize' [devid:][+/-]size[gkm]|[devid:]max path::
 
 Now, this has just thrown away all of the useful markup which
 indicates the semantics of the command. The asciidoc renders all of
 that text literally and unformatted, making alphasymbolic(*) soup of
 the docs. Compare this to the old roff man page:
 
 \fBbtrfs\fP \fBfilesystem resize\fP 
 [\fIdevid\fP:][+/\-]\fIsize\fP[gkm]|[\fIdevid\fP:]\fImax path\fP
 Yes, When I convert man pages to asciidoc docs, I have already realized the
 problem.
 As mentioned in first patch, most things including the Makefile is 'stolen'
 from git,
 which means I also apply the git way to deal with the all 'useful' markups,
 *just throw them away* .
 
 This isn't perfect -- we're missing a \fB around the max -- but
 it has text in bold(⁑) and italics(⁂) and neither(☃). I've just looked
 at some of the other pages, and they've also got similar typographical
 problems. This is a lot of fiddly tedious work to get it right, and if
 it doesn't get done now in the initial commit, then we're going to end
 up with poor examples copied for every new feature or docs update,
 making the problem worse before anyone does the work to make it
 better.
 As I mentions above, it's meant to be like this, without extra markup just
 like git.
 
 I choose asciidoc and the git documentation style for the following purpose:
 
 1) Split up the huge 'btrfs' man page. (Main purpose of the patchset)
 
 The 'btrfs' man page is so huge that the synopsis is serveral pages long,
 force developers to edit man page twice(one for synopsis and one for
 command).
 This makes editing frustrating and easy to make things inconsistent.
 (serveral synopsis and command description are already inconsistent)

   I don't have a problem with that, but it's irrelevant to my point.

 2) Make the documenation more general purpose. (Why choose asciidoc)
 
 Not only generating man pages, but also html/pdf, much like git.

   I have no objections at all to that either. It's a great idea. But
again, it's orthogonal to my main point here, which is that we've lost
useful semantics, because the markup _is_ part of the meaning of the
document.

 3) Make the original txt more human readable (Why choose git style)
 
 I can use the old markup method but after doing that I realize if you read a
 document full of
 markups, the markups have alread lost their meaning.
 
 If using too many markup, there will be other problems:
 3.1) making the synopsis in original txt too long
 
 This will make both developer and reviewer harder to
 edit/review.
 I choose asciidoc to reduce the hard-to-understand groff
 grammar, if these \fX are
 just converted to ' or `, IMO the cost is still not cut.

   I don't think we should be throwing away meaning just because it
makes things shorter in the source.

 3.2) the markup is not so highlighted if every thing is highlighted
 
 So I choose only to highlight really important things, since
 with all the bold and itatlic
 things full of the page, there is no difference between all
 normal formatted text.

   I'm only talking about two things: the syntactic summaries, where
we need to distinguish between literals, placeholders and descriptive
elements like []; and command text where we distinguish between the
descriptive English text and the stuff you type. This second point is
particularly important for us because so many of our commands consist
of recognisable English words strung together, and having the literal
commands shown in a distinctive style (typically a monospace font)
helps enormously with parsing the text when reading at speed.

 Due to the above 3 reasons, I think throwing all markup in synopsis is a
 better choice.

   I disagree strongly.

   Hugo.

 Thanks,
 Qu
 
 Hugo.
 
 (*) Or possibly alphashambolic. :)
 (⁑) For literal text
 (⁂) For variables that require substitution by the user

Re: [PATCH 00/27] Replace the old man page with asciidoc and man page for each btrfs subcommand.

2014-05-18 Thread Hugo Mills
On Sun, May 18, 2014 at 03:04:33PM +0800, Qu Wenruo wrote:
 
  Original Message 
 Subject: Re: [PATCH 00/27] Replace the old man page with asciidoc and man
 page for each btrfs subcommand.
 From: Hugo Mills h...@carfax.org.uk
 To: dste...@suse.cz, Qu Wenruo quwen...@cn.fujitsu.com,
 linux-btrfs@vger.kernel.org, c...@fb.com
 Date: 2014年05月18日 02:22
 On Sat, May 17, 2014 at 06:43:15PM +0100, Hugo Mills wrote:
 On Wed, Apr 16, 2014 at 07:12:19PM +0200, David Sterba wrote:
 On Wed, Apr 02, 2014 at 04:29:11PM +0800, Qu Wenruo wrote:
 Convert the old btrfs man pages to new asciidoc and split the huge
 btrfs man page into subcommand man page.
 I'm merging this patchset into the base series of integration because
 several patches need to update the docs and it's no longer feasible to
 keep it in a separate branch from the patches.
 I've just been poking around in the docs for a completely different
 reason, and I think there's a fairly serious problem (well, as serious
 as problems get with documentation).
 
 Take, for example, the format for btrfs fi resize:
 
 'resize' [devid:][+/-]size[gkm]|[devid:]max path::
 
 Now, this has just thrown away all of the useful markup which
 indicates the semantics of the command. The asciidoc renders all of
 that text literally and unformatted, making alphasymbolic(*) soup of
 the docs. Compare this to the old roff man page:
 
 \fBbtrfs\fP \fBfilesystem resize\fP 
 [\fIdevid\fP:][+/\-]\fIsize\fP[gkm]|[\fIdevid\fP:]\fImax path\fP
 
 This isn't perfect -- we're missing a \fB around the max -- but
 it has text in bold(⁑) and italics(⁂) and neither(☃). I've just looked
 at some of the other pages, and they've also got similar typographical
 problems. This is a lot of fiddly tedious work to get it right, and if
 it doesn't get done now in the initial commit, then we're going to end
 up with poor examples copied for every new feature or docs update,
 making the problem worse before anyone does the work to make it
 better.
 Oh, and asciidoc appears to be the most horrible capricious
 inconsistent parser in existence. I've just spent 5 minutes getting
 this one line of text to do what I want it to:
 
 =
 __N__**c**[\[_Mmin_**-**]__Mmax__**s**[_P_**p**]]
 =
 
 I had to run through the list of block quote operators one at a
 time in order to find the one I needed for this (the =); it's
 still not indenting it correctly on the resulting man page.
 
 Note also the fun things like the fact that [[]] is special, so you
 have to quote the opening part of it -- but if you try quoting the
 first [ with a \ you get a literal \[ in the output. You get the right
 output from quoting the *second* [ only.
 I have already encountered problems like that, especially when convert
 'btrfs-resize' related things.
 
 But wait for a minute, do we really need the *fascinating* highlight things
 in a user documentation?

   Yes, absolutely. The formatting is a part of the _meaning_ of the
documentation. Otherwise you're left guessing as to which pieces of
the string of characters are meant to be there literally, and which
pieces have to be replaced by suitable text, and which pieces are
optional.

 The most important thing is the content not the format.

   My point is that in this case the formatting _is_ a part of the
content.

 I choose asciidoc to make developers take less effort on the format things,
 not the opposite.
 In this point of view, I think asciidoc does things considerately well.
 
 Although the problem you mentioned is true, but it only affects a small part
 of the documentation,
 compared to the overall benefits, I still consider converting to asciidoc is
 worthy.

   I'm finding it almost impossible to make it do what I want. I think
in some cases it actually _is_ impossible. This is a truly frustrating
tool that is really not making things simpler, and I can see is going
to lead to even more badly marked up documentation -- simply because
it's too difficult and frustrating to get it right.

 The Nc can only be italicised and emboldened properly with __ and
 ** because _ and * require whitespace around them in order to work
 (seriously, WTF?). However, we can't be consistent with that in the
 _Mmin_**-** because the quoted \[ appears to count as whitespace, so
 using __Mmin__ gives us a leading literal _. The closing __ appears to
 close the single opening _ correctly in that case, though.
 
 Seriously, this is meant to be _easy_ to use? I think I'd rather
 type docbook by hand that have to struggle with this. Even the troff
 macros for man pages are simpler to get right.
 From the purpose of documentation and the above explaination, I think the
 answer is *Yes*.

   Only if you don't care about the typography of the resulting document.

 If you really think there is a better choice, I will be very happy
 to listen, but please consider what I mentioned above and the
 privious mail first.

   I don't have any real suggestions

Re: btrfs fi df output is not updated in a timely manner after subvolumes have been deleted

2014-05-20 Thread Hugo Mills
On Tue, May 20, 2014 at 02:50:10PM +0100, Astro Xe wrote:
 
 On my box, the used value in the output of btrfs filesystem df is not 
 updated in a timely manner, after that one or more subvolumes have been 
 deleted. I need to execute btrfs filesystem sync, in order to update the 
 value.
 
 How do I fix this? Or, could someone fix this in btrfs-progs, please?

   I've not tested this, but I think you need one of the two -c/-C
options to btrfs sub del, which perform a synchronous delete. Once the
command returns, you can be assured that the subvolume has actually
been deleted and any extents freed up.

 I suspect the cause is that subvolumes are marked immediately as deleted, but 
 cleaned up at a later time, is this right?

   Correct.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Great oxymorons of the world, no. 10: Business Ethics ---  


signature.asc
Description: Digital signature


Re: problem with degraded boot and systemd

2014-05-20 Thread Hugo Mills
On Wed, May 21, 2014 at 12:00:24AM +0200, Goffredo Baroncelli wrote:
 On 05/19/2014 02:54 AM, Chris Murphy wrote:
  Summary:
  
  It's insufficient to pass rootflags=degraded to get the system root
  to mount when a device is missing. It looks like when a device is
  missing, udev doesn't create the dev-disk-by-uuid linkage that then
  causes systemd to change the device state from dead to plugged. Only
  once plugged, will systemd attempt to mount the volume. This issue
  was brought up on systemd-devel under the subject timed out waiting
  for device dev-disk-by\x2duuid for those who want details.
  
 [...]
  
  I think the key problem is either a limitation of udev, or a problem
  with the existing udev rule, that prevents the link creation for any
  remaining btrfs device. Or maybe it's intentional. But I'm not a udev
  expert. This is the current udev rule:
  
  # cat /usr/lib/udev/rules.d/64-btrfs.rules 
  # do not edit this file, it will be overwritten on update
  
  SUBSYSTEM!=block, GOTO=btrfs_end ACTION==remove,
  GOTO=btrfs_end ENV{ID_FS_TYPE}!=btrfs, GOTO=btrfs_end
  
  # let the kernel know about this btrfs filesystem, and check if it is 
  complete 
  IMPORT{builtin}=btrfs ready $devnode
  
  # mark the device as not ready to be used by the system 
  ENV{ID_BTRFS_READY}==0, ENV{SYSTEMD_READY}=0
  
  LABEL=btrfs_end
 
 
 The key is the line 
 
   IMPORT{builtin}=btrfs ready $devnode
 
 This line sets ID_BTRFS_READY=0 if a filesystem is not ready; otherwise 
 set ID_BTRFS_READY=1 [1].
 The next line 
 
   ENV{ID_BTRFS_READY}==0, ENV{SYSTEMD_READY}=0
 
 sets SYSTEMD_READY=0 if the filesystem is not ready so the plug event
 is not raised to systemd.
 
 This is my understanding.
 
  
  
  How this works with raid:
  
  RAID assembly is separate from filesystem mount. The volume UUID
  isn't available until the RAID is successfully assembled.
  
  On at least Fedora (dracut) systems with the system root on an md
  device, the initramfs contains 30-parse-md.sh which includes a loop
  to check for the volume UUID. If it's not found, the script sleeps
  for 0.5 seconds, and then looks for it again, up to 240 times. If
  it's still not found at attempt 240, then the script executes mdadm
  -R to forcibly run the array with fewer than all devices present
  (degraded assembly). Now the volume UUID exists, udevd creates the
  linkage, systemd picks this up and changes device state from dead to
  plugged, and then executes a normal mount command.
 
  The approximate Btrfs equivalent down the road would be a similar
  initrd script, or maybe a user space daemon, that causes btrfs device
  ready to confirm/deny all devices are present. And after x number of
  failures, then it's issue an equivalent to mdadm -R which right now
  we don't seem to have.
 
 I suggest to implement a mount.btrfs command, which waits all the 
 needed disks until a timeout expires. After this timeout it could try
 a degraded mount until a second timeout. Only then it fails.
 
 Each time a device appear, the system may start mount.btrfs. Each 
 invocation has to test if there is another instance of mount.btrfs related
 to the same filesystem; if so it ends, otherwise it follows the above
 behavior.

   Don't we already have something approaching this functionality with
btrfs device ready? (i.e. this is exactly what it was designed for).

   Hugo.

  That equivalent might be a decoupling of degraded as a mount option,
  such that the user space tool deals with degradedness. And the mount
 [...]
  
  Chris Murphy
 G.Baroncelli
 
 [1] 
 http://lists.freedesktop.org/archives/systemd-commits/2012-September/002503.html
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Putting U back in Honor,  Valor, and Trth ---


signature.asc
Description: Digital signature


Re: [PATCH v3] lib: add size unit t/p/e to memparse

2014-06-13 Thread Hugo Mills
On Fri, Jun 13, 2014 at 07:54:44AM +0200, Brendan Hide wrote:
 On 12/06/14 23:15, Andrew Morton wrote:
 On Wed, 2 Apr 2014 16:54:37 +0800 Gui Hecheng guihc.f...@cn.fujitsu.com 
 wrote:
 + * %K (for kibibytes, or 1024 bytes),
 + * %M (for mebibytes, or 1048576 bytes),
 + * %G (for gibibytes, or 1073741824 bytes),
 + * %T (for tebibytes, or 1099511627776 bytes),
 + * %P (for pebibytes, or 1125899906842624 bytes),
 + * %E (for exbibytes, or 1152921504606846976 bytes).
 I'm afraid I find these names quite idiotic - we all know what the
 traditional terms mean so why go and muck with it.
 
 Also, kibibytes sounds like cat food.
 Hi, Andrew
 
 While I agree it sounds like cat food, it seemed like a good opportunity to
 fix a minor issue that is otherwise unlikely to be fixed for a very long
 time. Should we feel uncomfortable with the patch, as is, because of
 language/correctness friction? Pedantry included, the patch is correct. ;)

   Last night, I wrote a very grumpy reply to Andrew. I'm glad I
didn't send it, because Brendan has managed to cover at least one of
my points much more politely than I did.

   My other comment is that TB vs TiB is a 10% difference in the
magnitude of the number, and so the accumulated error is now no longer
small enough to be brushed under the carpet as we all did in days
past. By Andrew's thinking, a 4 TB disk is 3.638 TB in size. I'd say a
4 TB disk is 3.638 TiB in size, and I can be precise (±1GB in the
latter case) with both values.

   Hugo.

PS. Let's just not talk about 1.44 MB floppy disks.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- 2 + 2 = 5,  for sufficiently large values of 2. --- 


signature.asc
Description: Digital signature


Re: Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Hugo Mills
On Sun, Jun 15, 2014 at 11:31:07PM +0200, Martin Steigerwald wrote:
 Am Samstag, 14. Juni 2014, 02:53:20 schrieb Duncan:
   I am reaching the conclusion that fallocate is not the problem. The
   fallocate increase the filesize of about 8MB, which is enough for some
   logging. So it is not called very often.
  
  But... 
  
  If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
  nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
  MiB and write that out.  So far so good as at that point the 8 MiB should 
  be a single extent.  But then, data gets written into 4 KiB blocks of 
  that 8 MiB one at a time, and because btrfs is COW, the new data in the 
  block must be written to a new location.
  
  Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
  block has been rewritten to a new location and is now an extent unto 
  itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
  single 4 KiB block in size.
 
 I always thought that the whole point of fallocate is that it *doesn´t* write 
 out anything, but just reserves the space. Thus I don´t see how COW can have 
 any adverse effect here.

   Exactly. fallocate, as I understand it, says, I'm going to write
[this much] data at some point soon; you may want to allocate that
space in a contiguous manner right now to make the process more
effcient. The space is not a formal part of the file data and so
doesn't need a CoW operation when it's first written to.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- If the first-ever performance is the première,  is the --- 
  last-ever performance the derrière?   


signature.asc
Description: Digital signature


Re: Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Hugo Mills
 adm 200K Mai 11 10:21 
 user-2012@54803afb1b1d42b387822c56e61bc168-00011c75-0004ddb2be06d876.journal
 -rw-r-+ 1 root systemd-journal 3,8M Jun 11 21:14 user-2012.journal
 -rw-r-+ 1 root systemd-journal 3,6M Mai 26 14:04 
 user-65534@0004fa4c62bf4a71-6b4c53dfc06dd588.journal~
 -rw-r-+ 1 root systemd-journal 3,7M Jun  9 23:26 user-65534.journal
 
 
 
 
 
 
 merkaba:/var/log filefrag syslog*
 syslog: 361 extents found
 syslog.1: 202 extents found
 syslog.2.gz: 1 extent found
 [well sure, cause repacked]
 syslog.3.gz: 1 extent found
 syslog.4.gz: 1 extent found
 syslog.5.gz: 1 extent found
 syslog.6.gz: 1 extent found
 
 merkaba:/var/log ls -lh syslog*
 -rw-r- 1 root adm 4,2M Jun 15 23:39 syslog
 -rw-r- 1 root adm 2,1M Jun 11 16:07 syslog.1
 
 
 
 So we have ten times the extents on some systemd journal files than on
 rsyslog.
 
 
 With BTRFS RAID 1 on SSD with compress=lzo, so the 361 extents of syslog
 may be due to the size limit of extents on compressed BTRFS filesystems.
 
 Anyway, since it is flash, I never bothered about the fragmentation.
 
 Ciao,
 -- 
 Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
 GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- If the first-ever performance is the première,  is the --- 
  last-ever performance the derrière?   


signature.asc
Description: Digital signature


[PATCH] mkfs.btrfs: Fix compilation errors with gcc 4.6

2011-06-26 Thread Hugo Mills
gcc 4.6 complains about several possible use-before-initialise cases
in mkfs, and stops. Fix these by initialising one of the variables in
question, and using the correct error-handling paths for the
remainder.

Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 mkfs.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mkfs.c b/mkfs.c
index 3a87d6e..edd7018 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -750,7 +750,7 @@ static int add_file_items(struct btrfs_trans_handle *trans,
  ino_t parent_inum, struct stat *st,
  const char *path_name, int out_fd)
 {
-   int ret;
+   int ret = -1;
ssize_t ret_read;
u64 bytes_read = 0;
char *buffer = NULL;
@@ -889,7 +889,7 @@ static int traverse_directory(struct btrfs_trans_handle 
*trans,
ret = btrfs_lookup_inode(trans, root, path, root_dir_key, 1);
if (ret) {
fprintf(stderr, root dir lookup error\n);
-   goto fail;
+   return -1;
}
 
leaf = path.nodes[0];
@@ -913,7 +913,7 @@ static int traverse_directory(struct btrfs_trans_handle 
*trans,
if (chdir(parent_dir_entry-path)) {
fprintf(stderr, chdir error for %s\n,
parent_dir_name);
-   goto fail;
+   goto fail_no_files;
}
 
count = scandir(parent_dir_entry-path, files,
@@ -996,6 +996,7 @@ static int traverse_directory(struct btrfs_trans_handle 
*trans,
return 0;
 fail:
free_namelist(files, count);
+fail_no_files:
free(parent_dir_entry-path);
free(parent_dir_entry);
return -1;
-- 
1.7.2.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 6/8] btrfs: Balance filter for virtual address ranges

2011-06-26 Thread Hugo Mills
Allow the balancing of chunks where some part of the chunk lies within
the virtual (i.e. btrfs-internal) address range passed.

Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 fs/btrfs/ioctl.h   |9 +++--
 fs/btrfs/volumes.c |6 ++
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 21b0e6a..ba09b19 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -203,7 +203,8 @@ struct btrfs_ioctl_balance_progress {
 
 #define BTRFS_BALANCE_FILTER_CHUNK_TYPE (1  1)
 #define BTRFS_BALANCE_FILTER_DEVID (1  2)
-#define BTRFS_BALANCE_FILTER_MASK ((1  3) - 1) /* Logical or of all filter
+#define BTRFS_BALANCE_FILTER_VIRTUAL_ADDRESS_RANGE (1  3)
+#define BTRFS_BALANCE_FILTER_MASK ((1  4) - 1) /* Logical or of all filter
   * flags -- effectively versions
   * the filtered balance ioctl */
 
@@ -223,7 +224,11 @@ struct btrfs_ioctl_balance_start {
/* For FILTER_DEVID */
__u64 devid;
 
-   __u64 spare[506]; /* Make up the size of the structure to 4088
+   /* For FILTER_VIRTUAL_ADDRESS_RANGE */
+   __u64 vrange_start;
+   __u64 vrange_end;
+
+   __u64 spare[504]; /* Make up the size of the structure to 4088
   * bytes for future expansion */
 };
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 36d9018..828aa34 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2054,6 +2054,12 @@ int balance_chunk_filter(struct 
btrfs_ioctl_balance_start *filter,
if (!res)
return 0;
}
+   if (filter-flags  BTRFS_BALANCE_FILTER_VIRTUAL_ADDRESS_RANGE) {
+   u64 start = key-offset;
+   u64 end = start + btrfs_chunk_length(eb, chunk);
+   if (filter-vrange_start = end || start = filter-vrange_end)
+   return 0;
+   }
 
return 1;
 }
-- 
1.7.2.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 2/8] btrfs: Cancel filesystem balance

2011-06-26 Thread Hugo Mills
This patch adds an ioctl for cancelling a btrfs balance operation
mid-flight. The ioctl simply sets a flag, and the operation terminates
after the current block group move has completed.

Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/ioctl.c   |   28 
 fs/btrfs/ioctl.h   |1 +
 fs/btrfs/volumes.c |7 ++-
 4 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 25aa3cf..5031085 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -876,6 +876,7 @@ struct btrfs_block_group_cache {
 struct btrfs_balance_info {
u32 expected;
u32 completed;
+   int cancel_pending;
 };
 
 struct reloc_control;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 5ddf816..d4458d0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2868,6 +2868,32 @@ error:
return ret;
 }
 
+/*
+ * Cancel a running balance operation
+ */
+long btrfs_ioctl_balance_cancel(struct btrfs_fs_info *fs_info)
+{
+   int err = 0;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   spin_lock(fs_info-balance_info_lock);
+   if (!fs_info-balance_info) {
+   err = -EINVAL;
+   goto error;
+   }
+   if (fs_info-balance_info-cancel_pending) {
+   err = -ECANCELED;
+   goto error;
+   }
+   fs_info-balance_info-cancel_pending = 1;
+
+error:
+   spin_unlock(fs_info-balance_info_lock);
+   return err;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
 {
@@ -2915,6 +2941,8 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_balance(root-fs_info-dev_root);
case BTRFS_IOC_BALANCE_PROGRESS:
return btrfs_ioctl_balance_progress(root-fs_info, argp);
+   case BTRFS_IOC_BALANCE_CANCEL:
+   return btrfs_ioctl_balance_cancel(root-fs_info);
case BTRFS_IOC_CLONE:
return btrfs_ioctl_clone(file, arg, 0, 0, 0);
case BTRFS_IOC_CLONE_RANGE:
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 575b25f..edcbe61 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -255,4 +255,5 @@ struct btrfs_ioctl_balance_progress {
   struct btrfs_ioctl_fs_info_args)
 #define BTRFS_IOC_BALANCE_PROGRESS _IOR(BTRFS_IOCTL_MAGIC, 32, \
  struct btrfs_ioctl_balance_progress)
+#define BTRFS_IOC_BALANCE_CANCEL _IO(BTRFS_IOCTL_MAGIC, 33)
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4c0a386..f38b231 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2049,6 +2049,7 @@ int btrfs_balance(struct btrfs_root *dev_root)
bal_info-expected = -1; /* One less than actually counted,
because chunk 0 is special */
bal_info-completed = 0;
+   bal_info-cancel_pending = 0;
spin_unlock(dev_root-fs_info-balance_info_lock);
 
/* step one make some room on all the devices */
@@ -2109,7 +2110,7 @@ int btrfs_balance(struct btrfs_root *dev_root)
key.offset = (u64)-1;
key.type = BTRFS_CHUNK_ITEM_KEY;
 
-   while (1) {
+   while (!bal_info-cancel_pending) {
ret = btrfs_search_slot(NULL, chunk_root, key, path, 0, 0);
if (ret  0)
goto error;
@@ -2149,6 +2150,10 @@ int btrfs_balance(struct btrfs_root *dev_root)
   bal_info-completed, bal_info-expected);
}
ret = 0;
+   if (bal_info-cancel_pending) {
+   printk(KERN_INFO btrfs: balance cancelled\n);
+   ret = -EINTR;
+   }
 error:
btrfs_free_path(path);
spin_lock(dev_root-fs_info-balance_info_lock);
-- 
1.7.2.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 5/8] btrfs: Balance filter for device ID

2011-06-26 Thread Hugo Mills
Balance filter to take only chunks which have (or had) a stripe on the
given device. Useful if a device has been forcibly removed from the
filesystem, and the data from that device needs rebuilding.

Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 fs/btrfs/ioctl.h   |8 ++--
 fs/btrfs/volumes.c |   14 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 124296e..21b0e6a 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -202,7 +202,8 @@ struct btrfs_ioctl_balance_progress {
 #define BTRFS_BALANCE_FILTER_COUNT_ONLY (1  0)
 
 #define BTRFS_BALANCE_FILTER_CHUNK_TYPE (1  1)
-#define BTRFS_BALANCE_FILTER_MASK ((1  2) - 1) /* Logical or of all filter
+#define BTRFS_BALANCE_FILTER_DEVID (1  2)
+#define BTRFS_BALANCE_FILTER_MASK ((1  3) - 1) /* Logical or of all filter
   * flags -- effectively versions
   * the filtered balance ioctl */
 
@@ -219,7 +220,10 @@ struct btrfs_ioctl_balance_start {
__u64 chunk_type;  /* Flag bits required */
__u64 chunk_type_mask; /* Mask of bits to examine */
 
-   __u64 spare[507]; /* Make up the size of the structure to 4088
+   /* For FILTER_DEVID */
+   __u64 devid;
+
+   __u64 spare[506]; /* Make up the size of the structure to 4088
   * bytes for future expansion */
 };
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ea466ab..36d9018 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2021,6 +2021,7 @@ int balance_chunk_filter(struct btrfs_ioctl_balance_start 
*filter,
 {
struct extent_buffer *eb;
struct btrfs_chunk *chunk;
+   int i;
 
/* No filter defined, everything matches */
if (!filter)
@@ -2040,6 +2041,19 @@ int balance_chunk_filter(struct 
btrfs_ioctl_balance_start *filter,
return 0;
}
}
+   if (filter-flags  BTRFS_BALANCE_FILTER_DEVID) {
+   int num_stripes = btrfs_chunk_num_stripes(eb, chunk);
+   int res = 0;
+   for (i = 0; i  num_stripes; i++) {
+   struct btrfs_stripe *stripe = btrfs_stripe_nr(chunk, i);
+   if (btrfs_stripe_devid(eb, stripe) == filter-devid) {
+   res = 1;
+   break;
+   }
+   }
+   if (!res)
+   return 0;
+   }
 
return 1;
 }
-- 
1.7.2.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 1/8] btrfs: Balance progress monitoring

2011-06-26 Thread Hugo Mills
This patch introduces a basic form of progress monitoring for balance
operations, by counting the number of block groups remaining. The
information is exposed to userspace by an ioctl.

Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 fs/btrfs/ctree.h   |9 
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |   34 +++
 fs/btrfs/ioctl.h   |7 ++
 fs/btrfs/volumes.c |   56 ++-
 5 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3006287..25aa3cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -873,6 +873,11 @@ struct btrfs_block_group_cache {
struct list_head cluster_list;
 };
 
+struct btrfs_balance_info {
+   u32 expected;
+   u32 completed;
+};
+
 struct reloc_control;
 struct btrfs_device;
 struct btrfs_fs_devices;
@@ -1115,6 +1120,10 @@ struct btrfs_fs_info {
u64 fs_state;
 
struct btrfs_delayed_root *delayed_root;
+
+   /* Keep track of any rebalance operations on this FS */
+   spinlock_t balance_info_lock;
+   struct btrfs_balance_info *balance_info;
 };
 
 /*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1ac8db5d..38f8fbc 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1619,6 +1619,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
spin_lock_init(fs_info-fs_roots_radix_lock);
spin_lock_init(fs_info-delayed_iput_lock);
spin_lock_init(fs_info-defrag_inodes_lock);
+   spin_lock_init(fs_info-balance_info_lock);
mutex_init(fs_info-reloc_mutex);
 
init_completion(fs_info-kobj_unregister);
@@ -1648,6 +1649,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
fs_info-metadata_ratio = 0;
fs_info-defrag_inodes = RB_ROOT;
fs_info-trans_no_join = 0;
+   fs_info-balance_info = NULL;
 
fs_info-thread_pool_size = min_t(unsigned long,
  num_online_cpus() + 2, 8);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a3c4751..5ddf816 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2836,6 +2836,38 @@ static long btrfs_ioctl_scrub_progress(struct btrfs_root 
*root,
return ret;
 }
 
+/*
+ * Return the current status of any balance operation
+ */
+long btrfs_ioctl_balance_progress(
+   struct btrfs_fs_info *fs_info,
+   struct btrfs_ioctl_balance_progress __user *user_dest)
+{
+   int ret = 0;
+   struct btrfs_ioctl_balance_progress dest;
+
+   spin_lock(fs_info-balance_info_lock);
+   if (!fs_info-balance_info) {
+   ret = -EINVAL;
+   goto error;
+   }
+
+   dest.expected = fs_info-balance_info-expected;
+   dest.completed = fs_info-balance_info-completed;
+
+   spin_unlock(fs_info-balance_info_lock);
+
+   if (copy_to_user(user_dest, dest,
+sizeof(struct btrfs_ioctl_balance_progress)))
+   return -EFAULT;
+
+   return 0;
+
+error:
+   spin_unlock(fs_info-balance_info_lock);
+   return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
 {
@@ -2881,6 +2913,8 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_dev_info(root, argp);
case BTRFS_IOC_BALANCE:
return btrfs_balance(root-fs_info-dev_root);
+   case BTRFS_IOC_BALANCE_PROGRESS:
+   return btrfs_ioctl_balance_progress(root-fs_info, argp);
case BTRFS_IOC_CLONE:
return btrfs_ioctl_clone(file, arg, 0, 0, 0);
case BTRFS_IOC_CLONE_RANGE:
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index ad1ea78..575b25f 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -193,6 +193,11 @@ struct btrfs_ioctl_space_args {
struct btrfs_ioctl_space_info spaces[0];
 };
 
+struct btrfs_ioctl_balance_progress {
+   __u32 expected;
+   __u32 completed;
+};
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -248,4 +253,6 @@ struct btrfs_ioctl_space_args {
 struct btrfs_ioctl_dev_info_args)
 #define BTRFS_IOC_FS_INFO _IOR(BTRFS_IOCTL_MAGIC, 31, \
   struct btrfs_ioctl_fs_info_args)
+#define BTRFS_IOC_BALANCE_PROGRESS _IOR(BTRFS_IOCTL_MAGIC, 32, \
+ struct btrfs_ioctl_balance_progress)
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1efa56e..4c0a386 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2026,6 +2026,7 @@ int btrfs_balance(struct btrfs_root *dev_root)
struct btrfs_root *chunk_root = dev_root-fs_info-chunk_root;
struct btrfs_trans_handle *trans;
struct btrfs_key found_key;
+   struct btrfs_balance_info *bal_info;
 
if (dev_root-fs_info

[PATCH v8 0/8] Balance management patches, v8

2011-06-26 Thread Hugo Mills
   v7 was a real mess -- I made any number of errors in rebasing it
onto the 3.0 tree -- so I've skipped direct to v8. Thanks to David
Sterba for pointing out all the problems.

   Changes since v6: rebased to 3.0-rc4.

   This series can also be pulled from the balance-management-v8
branch of http://git.darksatanic.net/repo/btrfs-kernel.git/

   Hugo.

Hugo Mills (8):
  btrfs: Balance progress monitoring
  btrfs: Cancel filesystem balance
  btrfs: Factor out enumeration of chunks to a separate function
  btrfs: Implement filtered balance ioctl
  btrfs: Balance filter for device ID
  btrfs: Balance filter for virtual address ranges
  btrfs: Replication-type information
  btrfs: Balance filter for physical device address

 fs/btrfs/ctree.h   |   10 ++
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |  104 +-
 fs/btrfs/ioctl.h   |   49 ++
 fs/btrfs/super.c   |   16 +--
 fs/btrfs/volumes.c |  414 
 fs/btrfs/volumes.h |   23 +++-
 7 files changed, 482 insertions(+), 136 deletions(-)

-- 
1.7.2.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 8/8] btrfs: Balance filter for physical device address

2011-06-26 Thread Hugo Mills
Add a filter for balancing which allows the selection of chunks with
data in the given byte range on any block device in the filesystem. On
its own, this filter is of little use, but when used with the devid
filter, it can be used to rebalance all chunks which lie on a part of
a specific device.

Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 fs/btrfs/ioctl.h   |9 +++--
 fs/btrfs/volumes.c |   22 ++
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index ba09b19..08fcfed 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -204,7 +204,8 @@ struct btrfs_ioctl_balance_progress {
 #define BTRFS_BALANCE_FILTER_CHUNK_TYPE (1  1)
 #define BTRFS_BALANCE_FILTER_DEVID (1  2)
 #define BTRFS_BALANCE_FILTER_VIRTUAL_ADDRESS_RANGE (1  3)
-#define BTRFS_BALANCE_FILTER_MASK ((1  4) - 1) /* Logical or of all filter
+#define BTRFS_BALANCE_FILTER_DEVICE_ADDRESS_RANGE (1  4)
+#define BTRFS_BALANCE_FILTER_MASK ((1  5) - 1) /* Logical or of all filter
   * flags -- effectively versions
   * the filtered balance ioctl */
 
@@ -228,7 +229,11 @@ struct btrfs_ioctl_balance_start {
__u64 vrange_start;
__u64 vrange_end;
 
-   __u64 spare[504]; /* Make up the size of the structure to 4088
+   /* For FILTER_DEVICE_ADDRESS_RANGE */
+   __u64 drange_start;
+   __u64 drange_end;
+
+   __u64 spare[502]; /* Make up the size of the structure to 4088
   * bytes for future expansion */
 };
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fb11550..fa536e9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2067,6 +2067,7 @@ int balance_chunk_filter(struct btrfs_ioctl_balance_start 
*filter,
struct extent_buffer *eb;
struct btrfs_chunk *chunk;
int i;
+   struct btrfs_replication_info replinfo;
 
/* No filter defined, everything matches */
if (!filter)
@@ -2080,6 +2081,8 @@ int balance_chunk_filter(struct btrfs_ioctl_balance_start 
*filter,
chunk = btrfs_item_ptr(eb, path-slots[0],
   struct btrfs_chunk);
 
+   btrfs_get_replication_info(replinfo, btrfs_chunk_type(eb, chunk));
+
if (filter-flags  BTRFS_BALANCE_FILTER_CHUNK_TYPE) {
if ((btrfs_chunk_type(eb, chunk)  filter-chunk_type_mask)
!= filter-chunk_type) {
@@ -2105,6 +2108,25 @@ int balance_chunk_filter(struct 
btrfs_ioctl_balance_start *filter,
if (filter-vrange_start = end || start = filter-vrange_end)
return 0;
}
+   if (filter-flags  BTRFS_BALANCE_FILTER_DEVICE_ADDRESS_RANGE) {
+   int num_stripes = btrfs_chunk_num_stripes(eb, chunk);
+   int stripe_length = btrfs_chunk_length(eb, chunk)
+   * num_stripes / replinfo.num_copies;
+   int res = 0;
+
+   for (i = 0; i  num_stripes; i++) {
+   struct btrfs_stripe *stripe = btrfs_stripe_nr(chunk, i);
+   u64 start = btrfs_stripe_offset(eb, stripe);
+   u64 end = start + stripe_length;
+   if (filter-drange_start  end
+start  filter-drange_end) {
+   res = 1;
+   break;
+   }
+   }
+   if (!res)
+   return 0;
+   }
 
return 1;
 }
-- 
1.7.2.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 3/8] btrfs: Factor out enumeration of chunks to a separate function

2011-06-26 Thread Hugo Mills
The main balance function has two loops which are functionally
identical in their looping mechanism, but which perform a different
operation on the chunks they loop over. To avoid repeating code more
than necessary, factor this loop out into a separate iterator function
which takes a function parameter for the action to be performed.

Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 fs/btrfs/volumes.c |  174 +--
 1 files changed, 99 insertions(+), 75 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f38b231..a81fd3c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2014,6 +2014,97 @@ static u64 div_factor(u64 num, int factor)
return num;
 }
 
+/* Define a type, and two functions which can be used for the two
+ * phases of the balance operation: one for counting chunks, and one
+ * for actually moving them. */
+typedef void (*balance_iterator_function)(struct btrfs_root *,
+ struct btrfs_balance_info *,
+ struct btrfs_path *,
+ struct btrfs_key *);
+
+static void balance_count_chunks(struct btrfs_root *chunk_root,
+ struct btrfs_balance_info *bal_info,
+ struct btrfs_path *path,
+ struct btrfs_key *key)
+{
+   spin_lock(chunk_root-fs_info-balance_info_lock);
+   bal_info-expected++;
+   spin_unlock(chunk_root-fs_info-balance_info_lock);
+}
+
+static void balance_move_chunks(struct btrfs_root *chunk_root,
+struct btrfs_balance_info *bal_info,
+struct btrfs_path *path,
+struct btrfs_key *key)
+{
+   int ret;
+
+   ret = btrfs_relocate_chunk(chunk_root,
+  chunk_root-root_key.objectid,
+  key-objectid,
+  key-offset);
+   BUG_ON(ret  ret != -ENOSPC);
+   spin_lock(chunk_root-fs_info-balance_info_lock);
+   bal_info-completed++;
+   spin_unlock(chunk_root-fs_info-balance_info_lock);
+   printk(KERN_INFO btrfs: balance: %u/%u block groups completed\n,
+  bal_info-completed, bal_info-expected);
+}
+
+/* Iterate through all chunks, performing some function on each one. */
+static int balance_iterate_chunks(struct btrfs_root *chunk_root,
+  struct btrfs_balance_info *bal_info,
+  balance_iterator_function iterator_fn)
+{
+   int ret = 0;
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_key found_key;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+   key.offset = (u64)-1;
+   key.type = BTRFS_CHUNK_ITEM_KEY;
+
+   while (!bal_info-cancel_pending) {
+   ret = btrfs_search_slot(NULL, chunk_root, key, path, 0, 0);
+   if (ret  0)
+   break;
+   /*
+* this shouldn't happen, it means the last relocate
+* failed
+*/
+   if (ret == 0)
+   break;
+
+   ret = btrfs_previous_item(chunk_root, path, 0,
+ BTRFS_CHUNK_ITEM_KEY);
+   if (ret)
+   break;
+
+   btrfs_item_key_to_cpu(path-nodes[0], found_key,
+ path-slots[0]);
+   if (found_key.objectid != key.objectid)
+   break;
+
+   /* chunk zero is special */
+   if (found_key.offset == 0)
+   break;
+
+   /* Call the function to do the work for this chunk */
+   btrfs_release_path(path);
+   iterator_fn(chunk_root, bal_info, path, found_key);
+
+   key.offset = found_key.offset - 1;
+   }
+
+   btrfs_free_path(path);
+   return ret;
+}
+
 int btrfs_balance(struct btrfs_root *dev_root)
 {
int ret;
@@ -2021,11 +2112,8 @@ int btrfs_balance(struct btrfs_root *dev_root)
struct btrfs_device *device;
u64 old_size;
u64 size_to_free;
-   struct btrfs_path *path;
-   struct btrfs_key key;
struct btrfs_root *chunk_root = dev_root-fs_info-chunk_root;
struct btrfs_trans_handle *trans;
-   struct btrfs_key found_key;
struct btrfs_balance_info *bal_info;
 
if (dev_root-fs_info-sb-s_flags  MS_RDONLY)
@@ -2046,8 +2134,7 @@ int btrfs_balance(struct btrfs_root *dev_root)
}
spin_lock(dev_root-fs_info-balance_info_lock);
dev_root-fs_info-balance_info = bal_info;
-   bal_info-expected = -1; /* One less than actually counted,
-   because chunk 0 is special */
+   bal_info

[PATCH v8 7/8] btrfs: Replication-type information

2011-06-26 Thread Hugo Mills
There are a few places in btrfs where knowledge of the various
parameters of a replication type is needed. Factor this out into a
single function which can supply all the relevant information.

Signed-off-by: Hugo Mills h...@carfax.org.uk
---
 fs/btrfs/super.c   |   16 ++---
 fs/btrfs/volumes.c |  155 +---
 fs/btrfs/volumes.h |   17 ++
 3 files changed, 98 insertions(+), 90 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 0bb4ebb..2ea4e01 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -965,12 +965,12 @@ static int btrfs_calc_avail_data_space(struct btrfs_root 
*root, u64 *free_bytes)
struct btrfs_device_info *devices_info;
struct btrfs_fs_devices *fs_devices = fs_info-fs_devices;
struct btrfs_device *device;
+   struct btrfs_replication_info repl_info;
u64 skip_space;
u64 type;
u64 avail_space;
u64 used_space;
u64 min_stripe_size;
-   int min_stripes = 1;
int i = 0, nr_devices;
int ret;
 
@@ -984,12 +984,7 @@ static int btrfs_calc_avail_data_space(struct btrfs_root 
*root, u64 *free_bytes)
 
/* calc min stripe number for data space alloction */
type = btrfs_get_alloc_profile(root, 1);
-   if (type  BTRFS_BLOCK_GROUP_RAID0)
-   min_stripes = 2;
-   else if (type  BTRFS_BLOCK_GROUP_RAID1)
-   min_stripes = 2;
-   else if (type  BTRFS_BLOCK_GROUP_RAID10)
-   min_stripes = 4;
+   btrfs_get_replication_info(repl_info, type);
 
if (type  BTRFS_BLOCK_GROUP_DUP)
min_stripe_size = 2 * BTRFS_STRIPE_LEN;
@@ -1057,14 +1052,15 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_root *root, u64 *free_bytes)
 
i = nr_devices - 1;
avail_space = 0;
-   while (nr_devices = min_stripes) {
+   while (nr_devices = repl_info.devs_min) {
if (devices_info[i].max_avail = min_stripe_size) {
int j;
u64 alloc_size;
 
-   avail_space += devices_info[i].max_avail * min_stripes;
+   avail_space += devices_info[i].max_avail
+   * repl_info.devs_min;
alloc_size = devices_info[i].max_avail;
-   for (j = i + 1 - min_stripes; j = i; j++)
+   for (j = i + 1 - repl_info.devs_min; j = i; j++)
devices_info[j].max_avail -= alloc_size;
}
i--;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 828aa34..fb11550 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -117,6 +117,52 @@ static void requeue_list(struct btrfs_pending_bios 
*pending_bios,
pending_bios-tail = tail;
 }
 
+void btrfs_get_replication_info(struct btrfs_replication_info *info,
+   u64 type)
+{
+   info-sub_stripes = 1;
+   info-dev_stripes = 1;
+   info-devs_increment = 1;
+   info-num_copies = 1;
+   info-devs_max = 0; /* 0 == as many as possible */
+   info-devs_min = 1;
+
+   if (type  BTRFS_BLOCK_GROUP_DUP) {
+   info-dev_stripes = 2;
+   info-num_copies = 2;
+   info-devs_max = 1;
+   } else if (type  BTRFS_BLOCK_GROUP_RAID0) {
+   info-devs_min = 2;
+   } else if (type  BTRFS_BLOCK_GROUP_RAID1) {
+   info-devs_increment = 2;
+   info-num_copies = 2;
+   info-devs_max = 2;
+   info-devs_min = 2;
+   } else if (type  BTRFS_BLOCK_GROUP_RAID10) {
+   info-sub_stripes = 2;
+   info-devs_increment = 2;
+   info-num_copies = 2;
+   info-devs_min = 4;
+   }
+
+   if (type  BTRFS_BLOCK_GROUP_DATA) {
+   info-max_stripe_size = 1024 * 1024 * 1024;
+   info-min_stripe_size = 64 * 1024 * 1024;
+   info-max_chunk_size = 10 * info-max_stripe_size;
+   } else if (type  BTRFS_BLOCK_GROUP_METADATA) {
+   info-max_stripe_size = 256 * 1024 * 1024;
+   info-min_stripe_size = 32 * 1024 * 1024;
+   info-max_chunk_size = info-max_stripe_size;
+   } else if (type  BTRFS_BLOCK_GROUP_SYSTEM) {
+   info-max_stripe_size = 8 * 1024 * 1024;
+   info-min_stripe_size = 1 * 1024 * 1024;
+   info-max_chunk_size = 2 * info-max_stripe_size;
+   } else {
+   printk(KERN_ERR Block group is of an unknown usage type: not 
data, metadata or system.\n);
+   BUG_ON(1);
+   }
+}
+
 /*
  * we try to collect pending bios for a device so we don't get a large
  * number of procs sending bios down to the same device.  This greatly
@@ -1216,6 +1262,7 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
struct block_device *bdev

Integration branch updated

2011-06-26 Thread Hugo Mills
   I've just updated the btrfs-progs integration branch I've been
keeping. Not a huge amount new since last time:

Andreas Philipp (1):
  print parent ID in btrfs subvolume list

Goffredo Baroncelli (1):
  Scan the devices listed in /proc/partitions

Hugo Mills (1):
  mkfs.btrfs: Fix compilation errors with gcc 4.6

Zhong, Xin (1):
  btrfs-progs: Improvement for making btrfs image from source directory.

cwillu (1):
  Btrfs-progs: Correct path munging in bcp

   I've also re-worked my balance-management patches to deal with a
few oddities in the ordering of the help output, and the parameter
counts for the various balance commands.

   You can get the latest version from:

http://git.darksatanic.net/repo/btrfs-progs-unstable.git/ integration-20110626

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- I am the author. You are the audience. I outrank you! --- 


signature.asc
Description: Digital signature


Re: Integration branch updated

2011-06-27 Thread Hugo Mills
On Mon, Jun 27, 2011 at 03:03:30PM +0200, Andreas Philipp wrote:
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
  
 On 27.06.2011 14:43, David Sterba wrote:
  On Sun, Jun 26, 2011 at 10:10:22PM +0100, Hugo Mills wrote:
  I've just updated the btrfs-progs integration branch I've been
  keeping. Not a huge amount new since last time:
 
  Andreas Philipp (1):
  print parent ID in btrfs subvolume list
 
  dunno if this has been mentioned already, but this change breaks
  xfstests/254 and needs a patch once merged.
 Sorry, I was not aware of the problem with xfstests/254. But as far as

   I've not seen anything to that effect reported. I haven't been
running xfstests regularly, though...

 I see, xfstests/254 tests explicitly for subvolume/snapshot features
 in btrfs and uses a specific filter to parse the output of btrfs
 subvolume list. If this output changes (without introducing another
 error), then only the test is broken.
 Any suggestions on how to change the patch? Maybe adding a flag (-p ?)
 to add the parent ID in the output and leave the standard output
 untouched?

   That has the benefit of not breaking existing code that attempts to
parse it. Anything else is going to need xfstests/254 (and any other
users of the interface) to work out which version it's trying to parse
and dealing with it appropriately.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- If it's December 1941 in Casablanca,  what time is it ---  
  in New York?   


signature.asc
Description: Digital signature


Re: [PATCH v8 7/8] btrfs: Replication-type information

2011-06-28 Thread Hugo Mills
On Tue, Jun 28, 2011 at 06:32:43PM +0200, David Sterba wrote:
 On Sun, Jun 26, 2011 at 09:36:54PM +0100, Hugo Mills wrote:
  diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
  index 828aa34..fb11550 100644
  --- a/fs/btrfs/volumes.c
  +++ b/fs/btrfs/volumes.c
  @@ -117,6 +117,52 @@ static void requeue_list(struct btrfs_pending_bios 
  *pending_bios,
  pending_bios-tail = tail;
   }
   
  +void btrfs_get_replication_info(struct btrfs_replication_info *info,
  +   u64 type)
  +{
  +   info-sub_stripes = 1;
  +   info-dev_stripes = 1;
  +   info-devs_increment = 1;
  +   info-num_copies = 1;
  +   info-devs_max = 0; /* 0 == as many as possible */
  +   info-devs_min = 1;
  +
  +   if (type  BTRFS_BLOCK_GROUP_DUP) {
  +   info-dev_stripes = 2;
  +   info-num_copies = 2;
  +   info-devs_max = 1;
  +   } else if (type  BTRFS_BLOCK_GROUP_RAID0) {
  +   info-devs_min = 2;
  +   } else if (type  BTRFS_BLOCK_GROUP_RAID1) {
  +   info-devs_increment = 2;
  +   info-num_copies = 2;
  +   info-devs_max = 2;
  +   info-devs_min = 2;
  +   } else if (type  BTRFS_BLOCK_GROUP_RAID10) {
  +   info-sub_stripes = 2;
  +   info-devs_increment = 2;
  +   info-num_copies = 2;
  +   info-devs_min = 4;
  +   }
  +
  +   if (type  BTRFS_BLOCK_GROUP_DATA) {
  +   info-max_stripe_size = 1024 * 1024 * 1024;
  +   info-min_stripe_size = 64 * 1024 * 1024;
  +   info-max_chunk_size = 10 * info-max_stripe_size;
  +   } else if (type  BTRFS_BLOCK_GROUP_METADATA) {
  +   info-max_stripe_size = 256 * 1024 * 1024;
  +   info-min_stripe_size = 32 * 1024 * 1024;
  +   info-max_chunk_size = info-max_stripe_size;
  +   } else if (type  BTRFS_BLOCK_GROUP_SYSTEM) {
  +   info-max_stripe_size = 8 * 1024 * 1024;
  +   info-min_stripe_size = 1 * 1024 * 1024;
  +   info-max_chunk_size = 2 * info-max_stripe_size;
  +   } else {
  +   printk(KERN_ERR Block group is of an unknown usage type: not 
  data, metadata or system.\n);
  +   BUG_ON(1);

   From inspection, this looks like it's a viable solution:

+   info-max_stripe_size = 0;
+   info-min_stripe_size = -1ULL;
+   info-max_chunk_size = 0;

We only run into problems if a user of this function passes a
RAID-only block group type and then tries to use the size parameters
from it. There's only three users of the function currently, and this
case is the only one that doesn't pass a real block group type flag.

   I'll run a quick test of dev rm and see what happens...

 I'm hitting this BUG_ON with 'btrfs device delete', type = 24 which is
 BTRFS_BLOCK_GROUP_RAID0 + BTRFS_BLOCK_GROUP_RAID1 .
 
 in btrfs_rm_device:
 
 1277 all_avail = root-fs_info-avail_data_alloc_bits |
 1278 root-fs_info-avail_system_alloc_bits |
 1279 root-fs_info-avail_metadata_alloc_bits;
 
 the values before the call are:
 
 [  105.107074] D: all_avail 24
 [  105.111844] D: root-fs_info-avail_data_alloc_bits 8
 [  105.118858] D: root-fs_info-avail_system_alloc_bits 16
 [  105.126110] D: root-fs_info-avail_metadata_alloc_bits 16
 
 
 there are 5 devices, sdb5 - sdb9, i'm removing sdb9, after clean
 mount.
 
 
 david

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- vi vi vi:  the Editor of the Beast. ---   


signature.asc
Description: Digital signature


Re: subvolumes missing from btrfs subvolume list output

2011-06-29 Thread Hugo Mills
On Wed, Jun 29, 2011 at 12:16:06PM -0400, Josef Bacik wrote:
 On 06/29/2011 11:00 AM, Stephane Chazelas wrote:
  2011-06-29 15:37:47 +0100, Stephane Chazelas:
  [...]
  I found
  http://thread.gmane.org/gmane.comp.file-systems.btrfs/8123/focus=8208
 
  which looks like the same issue, with Li Zefan saying he had a
  fix, but I couldn't find any mention that it was actually fixed.
 
  Has anybody got any update on that?
  [...]
  
  I've found
  http://thread.gmane.org/gmane.comp.file-systems.btrfs/8232
  
  but no corresponding fix or ioctl.c
  http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=history;f=fs/btrfs/ioctl.c
  
  I'm under the impression that the issue has been forgotten
  about.
  
  From what I managed to gather though, it seems that what's on
  disk is correct, it's just the ioctl and/or btrfs sub list
  that's wrong. Am I right?
 
 Yeah, did you apply the patch from that thread and verify that it fixes
 your problem?  Thanks,

   Note that changing this API will probably break btrfs-gui's listing
of subvolumes...

   The issue with that patch is that there are two distinct behaviours
that people want or expect with the tree-search ioctl:

(A) Return all items with keys which collate linearly between
(min_objectid, min_type, min_offset) and 
(max_objectid, max_type, max_offset)

i.e. treating keys as indivisible objects and sorting lexically,
as the trees do.

(B) Return all items with keys (i, t, o) which fulfil the criteria
(min_objectid = i = max_objectid,
 min_type = t = max_type,
 min_offset = o = max_offset)

i.e. treating keys as 3-tuples, and selecting from a rectilinear
subsset of the tuple space, which is natural for some
applications.

   Clearly, we can't do both with the same call (except for some
limited cases (*)). However, different users expect different
behaviours. The current behaviour is (A), which is the natural
behaviour for tree searches within the btrfs code, and is (IMO) the
right thing to be doing for an API like this.

   It sounds to me like the user of the API needs to be fixed, not the
ioctl itself -- possibly the author of the subvol scanning code
assumed (B) when they were getting (A). Note that there is at least
one other user of the ioctl outside btrfs-progs: btrfs-gui, which uses
the ioctl for several things, one of which is enumerating subvolumes
as btrfs-progs does.

   It should be possible to write an additional ioctl for behaviour
(B) which contains both min and max limits on each element of the key
3-tuple, *and* the current search state. That would reduce developer
confusion (given appropriate comments or documentation to explain what
the difference between the two is). However, I'm not sufficiently
convinced that it's actually necessary right now. I may change my tune
after I've started doing some of the more complex bits I'd thought of
doing with btrfs-gui, but for now, it's perfectly possible to use the
existing API without too much hassle.

   Hugo.

(*) The limited cases where both behaviours return the same set of
keys are:

(i_0, 0, 0) to (i_1, -1UL, -1UL)
(i, t_0, 0) to (i, t_1, -1UL)
(i, t, o_0) to (i, t, o_1)

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- I get nervous when I see words like 'mayhaps' in a novel, ---
because I fear that just round the corner
  is lurking 'forsooth'  


signature.asc
Description: Digital signature


Re: subvolumes missing from btrfs subvolume list output

2011-06-29 Thread Hugo Mills
On Wed, Jun 29, 2011 at 05:47:41PM +0100, Hugo Mills wrote:
 (*) The limited cases where both behaviours return the same set of
 keys are:
 
 (i_0, 0, 0) to (i_1, -1UL, -1UL)
I clearly meant  (i_1,  255, -1UL) here...

 (i, t_0, 0) to (i, t_1, -1UL)
 (i, t, o_0) to (i, t, o_1)

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- I get nervous when I see words like 'mayhaps' in a novel, ---
because I fear that just round the corner
  is lurking 'forsooth'  


signature.asc
Description: Digital signature


Re: subvolumes missing from btrfs subvolume list output

2011-06-30 Thread Hugo Mills
On Thu, Jun 30, 2011 at 12:52:59PM +0200, Andreas Philipp wrote:
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
  
 On 30.06.2011 12:43, Stephane Chazelas wrote:
  2011-06-30 11:18:42 +0200, Andreas Philipp: [...]
  After that, I posted a patch to fix btrfs-progs, which Chris
  aggreed on:
 
  http://marc.info/?l=linux-btrfsm=129238454714319w=2
  [...]
 
  Great. Thanks a lot
 
  It fixes my problem indeed.
 
  Which brings me to my next question: where to find the latest
  btrfs-progs if not at
 
 git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git
 
 
 [...]
  Hugo Mills keeps an integration branch with nearly all patches
  to btrfs-progs applied. See
 
  http://www.spinics.net/lists/linux-btrfs/msg10594.html
 
  and for the last update
 
  http://www.spinics.net/lists/linux-btrfs/msg10890.html
  [...]
 
  Thanks.
 
  It might be worth adding a link to that to
  https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories
 
  Note that it (integration-20110626) doesn't seem to include the fix
  in http://marc.info/?l=linux-btrfsm=129238454714319w=2 though.
 Hi Hugo,
 
 Can you please include that fix in the next release of your
 integration branch for btrfs-progs-unstable?

   Yes, will do.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Someone's been throwing dead sheep down my Fun Well ---   


signature.asc
Description: Digital signature


Re: subvolumes missing from btrfs subvolume list output

2011-06-30 Thread Hugo Mills
On Thu, Jun 30, 2011 at 11:43:40AM +0100, Stephane Chazelas wrote:
 2011-06-30 11:18:42 +0200, Andreas Philipp:
 [...]
   After that, I posted a patch to fix btrfs-progs, which Chris
   aggreed on:
  
   http://marc.info/?l=linux-btrfsm=129238454714319w=2
   [...]
  
   Great. Thanks a lot
  
   It fixes my problem indeed.
  
   Which brings me to my next question: where to find the latest
   btrfs-progs if not at
   git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git
 [...]
  Hugo Mills keeps an integration branch with nearly all patches to
  btrfs-progs applied.
  See
  
  http://www.spinics.net/lists/linux-btrfs/msg10594.html
  
  and for the last update
  
  http://www.spinics.net/lists/linux-btrfs/msg10890.html
 [...]
 
 Thanks.
 
 It might be worth adding a link to that to
 https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories
 
 Note that it (integration-20110626) doesn't seem to include the fix in
 http://marc.info/?l=linux-btrfsm=129238454714319w=2 though.

   No, I didn't see it when I did my trawl through the mailing list
archives, because it wasn't marked as [PATCH]. I'll pull it in for the
next round of the integration tree, though.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Someone's been throwing dead sheep down my Fun Well ---   


signature.asc
Description: Digital signature


Re: [PATCH] [btrfs-progs integration] incorrect argument checking for btrfs sub snap -r

2011-06-30 Thread Hugo Mills
On Thu, Jun 30, 2011 at 01:34:38PM +0100, Stephane Chazelas wrote:
 Looks like this was missing in integration-20110626 for the
 readonly snapshot patch:
 
 diff --git a/btrfs.c b/btrfs.c
 index e117172..be6ece5 100644
 --- a/btrfs.c
 +++ b/btrfs.c
 @@ -49,7 +49,7 @@ static struct Command commands[] = {
   /*
   avoid short commands different for the case only
   */
 - { do_clone, 2,
 + { do_clone, -1,
 subvolume snapshot, [-r] source [dest/]name\n
   Create a writable/readonly snapshot of the subvolume source 
 with\n
   the name name in the dest directory.,
 
 Without that, btrfs sub snap -r x y would fail as it's not *2*
 arguments.

   Thanks. Added to the queue.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- How deep will this sub go? Oh,  she'll go all the way to ---   
the bottom if we don't stop her.


signature.asc
Description: Digital signature


Re: [PATCH] [btrfs-progs integration] incorrect argument checking for btrfs sub snap -r

2011-06-30 Thread Hugo Mills
On Thu, Jun 30, 2011 at 10:55:15PM +0200, Andreas Philipp wrote:
 On 30.06.2011 14:34, Stephane Chazelas wrote:
  Looks like this was missing in integration-20110626 for the
  readonly snapshot patch:
 
  diff --git a/btrfs.c b/btrfs.c
  index e117172..be6ece5 100644
  --- a/btrfs.c
  +++ b/btrfs.c
  @@ -49,7 +49,7 @@ static struct Command commands[] = {
  /*
  avoid short commands different for the case only
  */
  - { do_clone, 2,
  + { do_clone, -1,
  subvolume snapshot, [-r] source [dest/]name\n
  Create a writable/readonly snapshot of the subvolume source with\n
  the name name in the dest directory.,
 
  Without that, btrfs sub snap -r x y would fail as it's not *2*
  arguments.
 Unfortunately, this is not correct either. -1 means that the minimum
 number of arguments is 1 and since we need at least source and
 name this is 2. So the correct version should be -2.

   OK, I'll fix that here, as the patch is part of my pull request for
Chris. (I saw the [] around dest but missed that name was
mandatory... it's been a long day).

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Nothing wrong with being written in Perl... Some of my best ---   
  friends are written in Perl.   


signature.asc
Description: Digital signature


btrfs-progs: integration branch updated

2011-06-30 Thread Hugo Mills
   After a reorganisation of patches, and sending a bunch of them to
Chris, I've also updated the integration branch to match that. It's
available from:

http://git.darksatanic.net/repo/btrfs-progs-unstable.git/ integration-20110630

   The shortlog of 17 patches in this branch beyond the ones I've sent
to Chris is below.

   Hugo.


Andreas Philipp (1):
  print parent ID in btrfs subvolume list

Goffredo Baroncelli (1):
  Scan the devices listed in /proc/partitions

Hugo Mills (8):
  Balance progress monitoring.
  Add --monitor option to btrfs balance progress.
  User-space tool for cancelling balance operations.
  Run userspace tool in background for balances.
  Initial implementation of userspace interface for filtered balancing.
  Balance filter by device ID
  Balance filter for virtual address range
  Interface for device range balance filter

Jan Schmidt (5):
  commands added
  scrub ioctls
  added check_mounted_where
  scrub userland implementation
  scrub added to manpage

WuBo (1):
  Btrfs-progs: Add chunk tree recover tool

Zhong, Xin (1):
  btrfs-progs: Improvement for making btrfs image from source directory.


-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Nothing wrong with being written in Perl... Some of my best ---   
  friends are written in Perl.   


signature.asc
Description: Digital signature


<    1   2   3   4   5   6   7   8   9   10   >