date:20170801

Re: [PATCH] btrfs: copy fsid to super_block s_uuid

2017-08-01 Thread Anand Jain



Hi Darrick,

 Thanks for commenting..


+   memcpy(>s_uuid, fs_info->fsid, BTRFS_FSID_SIZE);


uuid_copy()?


  It requires a larger migration to use uuid_t, IMO it can be done all
  together, in a separate patch ?

  Just for experiment, starting with struct btrfs_fs_info.fsid and
  to check its foot prints, I just renamed fsid to fs_id, and compiled.
  It reports 73 'has no member named ‘fsid'' errors.
  So looks like redefining u8 fsid[] to uuid_t fsid and further updating
  all its foot prints, has to be simplified. Any suggestions ?

Thanks, Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: verify_dir_item fails in replay_xattr_deletes

2017-08-01 Thread Lu Fengqi

From: Su Yue 

In replay_xattr_deletes(), the argument @slot of verify_dir_item()
should be variable @i instead of path->slots[0].

The bug causes failure of generic/066 and shared/002 in xfstest.
dmesg:
[12507.810781] BTRFS critical (device dm-0): invalid dir item name len: 10
[12507.811185] BTRFS: error (device dm-0) in btrfs_replay_log:2475: errno=-5 IO 
failure (Failed to recover log tree)
[12507.811928] BTRFS error (device dm-0): cleaner transaction attach returned 
-30
[12507.821020] BTRFS error (device dm-0): open_ctree failed
[12508.131526] BTRFS info (device dm-0): disk space caching is enabled
[12508.132145] BTRFS info (device dm-0): has skinny extents
[12508.136265] BTRFS critical (device dm-0): invalid dir item name len: 10
[12508.136678] BTRFS: error (device dm-0) in btrfs_replay_log:2475: errno=-5 IO 
failure (Failed to recover log tree)
[12508.137501] BTRFS error (device dm-0): cleaner transaction attach returned 
-30
[12508.147982] BTRFS error (device dm-0): open_ctree failed

Signed-off-by: Su Yue 
---
 fs/btrfs/tree-log.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index f20ef211a73d..3a11ae63676e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2153,8 +2153,7 @@ static int replay_xattr_deletes(struct btrfs_trans_handle 
*trans,
u32 this_len = sizeof(*di) + name_len + data_len;
char *name;
 
-   ret = verify_dir_item(fs_info, path->nodes[0],
- path->slots[0], di);
+   ret = verify_dir_item(fs_info, path->nodes[0], i, di);
if (ret) {
ret = -EIO;
goto out;
-- 
2.13.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Massive loss of disk space

2017-08-01 Thread Duncan

Austin S. Hemmelgarn posted on Tue, 01 Aug 2017 10:47:30 -0400 as
excerpted:

> I think I _might_ understand what's going on here.  Is that test program
> calling fallocate using the desired total size of the file, or just
> trying to allocate the range beyond the end to extend the file?  I've
> seen issues with the first case on BTRFS before, and I'm starting to
> think that it might actually be trying to allocate the exact amount of
> space requested by fallocate, even if part of the range is already
> allocated space.

If I've interpreted correctly (not being a dev, only a btrfs user, 
sysadmin, and list regular) previous discussions I've seen on this list...

That's exactly what it's doing, and it's _intended_ behavior.

The reasoning is something like this:  fallocate is supposed to pre-
allocate some space with the intent being that writes into that space 
won't fail, because the space is already allocated.

For an existing file with some data already in it, ext4 and xfs do that 
counting the existing space.

But btrfs is copy-on-write, meaning it's going to have to write the new 
data to a different location than the existing data, and it may well not 
free up the existing allocation (if even a single 4k block of the 
existing allocation remains unwritten, it will remain to hold down the 
entire previous allocation, which isn't released until *none* of it is 
still in use -- of course in normal usage "in use" can be due to old 
snapshots or other reflinks to the same extent, as well, tho in these 
test cases it's not).

So in ordered to provide the writes to preallocated space shouldn't ENOSPC 
guarantee, btrfs can't count currently actually used space as part of the 
fallocate.

The different behavior is entirely due to btrfs being COW, and thus a 
choice having to be made, do we worst-case fallocate-reserve for writes 
over currently used data that will have to be COWed elsewhere, possibly 
without freeing the existing extents because there's still something 
referencing them, or do we risk ENOSPCing on write to a previously 
fallocated area?

The choice was to worst-case-reserve and take the ENOSPC risk at fallocate 
time, so the write into that fallocated space could then proceed without 
the ENOSPC risk that COW would otherwise imply.

Make sense, or is my understanding a horrible misunderstanding? =:^)

So if you're actually only appending, fallocate the /additional/ space, 
not the /entire/ space, and you'll get what you need.  But if you're 
potentially overwriting what's there already, better fallocate the entire 
space, which triggers the btrfs worst-case allocation behavior you see, 
in ordered to guarantee it won't ENOSPC during the actual write.

Of course the only time the behavior actually differs is with COW, but 
then there's a BIG difference, but that BIG difference has a GOOD BIG 
reason!  =:^)

Tho that difference will certainly necessitate some relearning the 
/correct/ way to do it, for devs who were doing it the COW-worst-case way 
all along, even if they didn't actually need to, because it didn't happen 
to make a difference on what they happened to be testing on, which 
happened not to be COW...

Reminds me of the way newer versions of gcc and/or trying to build with 
clang as well tends to trigger relearning, because newer versions are 
stricter in ordered to allow better optimization, and other 
implementations are simply different in what they're strict on, /because/ 
they're a different implementation.  Well, btrfs is stricter... because 
it's a different implementation that /has/ to be stricter... due to COW.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS error: bad tree block start 0 623771648

2017-08-01 Thread Duncan

Roman Mamedov posted on Tue, 01 Aug 2017 11:08:05 +0500 as excerpted:

> On Sun, 30 Jul 2017 18:14:35 +0200 "marcel.cochem"
>  wrote:
> 
>> I am pretty sure that not all data is lost as i can grep thorugh the
>> 100 GB SSD partition. But my question is, if there is a tool to rescue
>> all (intact) data and maybe have only a few corrupt files which can't
>> be recovered.
> 
> There is such a tool, see
> https://btrfs.wiki.kernel.org/index.php/Restore

I was going to suggest that too... and even started a reply to do so... 
upon which I read a bit closer and saw he'd actually tried restore 
already...

And before you suggest it, he tried btrfs-find-root as well, and it 
didn't work either, so he can't do the advanced/technical mode of 
restore, feeding it addresses from btrfs-find-root, either. =:^(

It's in the post...

So unfortunately he's pretty much left with manual hacking and scraping, 
and that's at a level beyond what I at least am able to help him with...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Peter Grandi

[ ... ]

> This is the "storage for beginners" version, what happens in
> practice however depends a lot on specific workload profile
> (typical read/write size and latencies and rates), caching and
> queueing algorithms in both Linux and the HA firmware.

To add a bit of slightly more advanced discussion, the main
reason for larger strips ("chunk size) is to avoid the huge
latencies of disk rotation using unsynchronized disk drives, as
detailed here:

  http://www.sabi.co.uk/blog/12-thr.html?120310#120310

That relates weakly to Btrfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS error: bad tree block start 0 623771648

2017-08-01 Thread Liu Bo

On Tue, Aug 01, 2017 at 11:04:10AM +0500, Roman Mamedov wrote:
> On Mon, 31 Jul 2017 11:12:01 -0700
> Liu Bo  wrote:
> 
> > Superblock and chunk tree root is OK, looks like the header part of
> > the tree root is now all-zero, but I'm unable to think of a btrfs bug
> > which can lead to that (if there is, it is a serious enough one)
> 
> I see that the FS is being mounted with "discard". So maybe it was a TRIM gone
> bad (wrong location or in a wrong sequence).
>

By checking discard path in btrfs, looks OK to me, more likely it's
caused by problems from underlying stuff.

Thanks,

-liubo

> Generally it appears to be not recommended to use "discard" by now (because of
> its performance impact, and maybe possible issues like this), instead schedule
> to call "fstrim " once a day or so, and/or on boot-up.
> 
> > on ssd like disks, by default there is only one copy for metadata.
> 
> Time and time again, the default of "single" metadata for SSD is a terrible
> idea. Most likely DUP metadata would save the FS in this case.
> 
> -- 
> With respect,
> Roman
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Goffredo Baroncelli

On 2017-08-01 23:00, Christoph Anton Mitterer wrote:
> Hi.
> 
> Stupid question:
> Would the write hole be closed already, if parity was checksummed?

No. 
The write hole problem is due to a combination of two things:
a) misalignment between parity and data (i.e. unclean shutdown)
b) loosing of a disk (i.e. disk rupture)

Note1: the write hole problem happens even if these two event are not 
consecutive.

After the disk rupture, when you need to read a data from the broken disk, you 
need the parity to compute the data. But if the parity is misaligned, wrong 
data is returned. 

The data checksum are sufficient to detect if wrong data is returned. The 
checksum parity is not needed. In any case both can't avoid the problem.

> Cheers,
> Chris.
> 

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Goffredo Baroncelli

On 2017-08-01 19:24, Liu Bo wrote:
> On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
>> Hi Liu,
>>
>> On 2017-08-01 18:14, Liu Bo wrote:
>>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
>>> separate disk as a journal (aka raid5/6 log), so that after unclean
>>> shutdown we can make sure data and parity are consistent on the raid
>>> array by replaying the journal.
>>>
>>
>> it would be possible to have more information ?
>> - what is logged ? data, parity or data + parity ?
> 
> Patch 5 has more details(sorry for not making it clear that in the
> cover letter).
> 
> So both data and parity are logged so that while replaying the journal
> everything is written to whichever disk it should be written to.

It is correct reading this as: all data is written two times ? Or are logged 
only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, 
the log is bypassed )?
> 
>> - in the past I thought that it would be sufficient to log only the stripe 
>> position involved by a RMW cycle, and then start a scrub on these stripes in 
>> case of an unclean shutdown: do you think that it is feasible ?
> 
> An unclean shutdown causes inconsistence between data and parity, so
> scrub won't help as it's not able to tell which one (data or parity)
> is valid
Scrub compares data against its checksum; so it knows if the data is correct. 
If no disk is lost, a scrub process is sufficient/needed to rebuild the 
parity/data.

The problem born when after "an unclean shutdown" a disk failure happens. But 
these  are *two* distinct failures. These together break the BTRFS raid5 
redundancy. But if you run a scrub process between these two failures, the 
btrfs raid5 redundancy is still effective.


> 
> With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
> With datacow, we don't do overwrite, but the following situation may happen,
> say we have a raid5 setup with 3 disks, the stripe length is 64k, so
> 
> 1) write 64K  --> now the raid layout is
> [64K data + 64K random + 64K parity]
> 2) write another 64K --> now the raid layout after RMW is
> [64K 1)'s data + 64K 2)'s data + 64K new parity]
> 
> If unclean shutdown occurs before 2) finishes, then parity may be
> corrupted and then 1)'s data may be recovered wrongly if the disk
> which holds 1)'s data is offline.
> 
>> - does this journal disk also host other btrfs log ?
>>
> 
> No, purely data/parity and some associated metadata.
> 
> Thanks,
> 
> -liubo
> 
>>> The idea and the code are similar to the write-through mode of md
>>> raid5-cache, so ppl(partial parity log) is also feasible to implement.
>>> (If you've been familiar with md, you may find this patch set is
>>> boring to read...)
>>>
>>> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
>>> the implementation, the rest patches are improvements and bugfixes,
>>> eg. readahead for recovery, checksum.
>>>
>>> Two btrfs-progs patches are required to play with this patch set, one
>>> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
>>> option '-L', the other is to teach 'btrfs-show-super' to show
>>> %journal_tail.
>>>
>>> This is currently based on 4.12-rc3.
>>>
>>> The patch set is tagged with RFC, and comments are always welcome,
>>> thanks.
>>>
>>> Known limitations:
>>> - Deleting a log device is not implemented yet.
>>>
>>>
>>> Liu Bo (14):
>>>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>>>   Btrfs: raid56: do not allocate chunk on raid56 log
>>>   Btrfs: raid56: detect raid56 log on mount
>>>   Btrfs: raid56: add verbose debug
>>>   Btrfs: raid56: add stripe log for raid5/6
>>>   Btrfs: raid56: add reclaim support
>>>   Btrfs: raid56: load r5log
>>>   Btrfs: raid56: log recovery
>>>   Btrfs: raid56: add readahead for recovery
>>>   Btrfs: raid56: use the readahead helper to get page
>>>   Btrfs: raid56: add csum support
>>>   Btrfs: raid56: fix error handling while adding a log device
>>>   Btrfs: raid56: initialize raid5/6 log after adding it
>>>   Btrfs: raid56: maintain IO order on raid5/6 log
>>>
>>>  fs/btrfs/ctree.h|   16 +-
>>>  fs/btrfs/disk-io.c  |   16 +
>>>  fs/btrfs/ioctl.c|   48 +-
>>>  fs/btrfs/raid56.c   | 1429 
>>> ++-
>>>  fs/btrfs/raid56.h   |   82 +++
>>>  fs/btrfs/transaction.c  |2 +
>>>  fs/btrfs/volumes.c  |   56 +-
>>>  fs/btrfs/volumes.h  |7 +-
>>>  include/uapi/linux/btrfs.h  |3 +
>>>  include/uapi/linux/btrfs_tree.h |4 +
>>>  10 files changed, 1487 insertions(+), 176 deletions(-)
>>>
>>
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli 
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Christoph Anton Mitterer

Hi.

Stupid question:
Would the write hole be closed already, if parity was checksummed?

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: Slow mounting raid1

2017-08-01 Thread Timofey Titovets

2017-08-01 23:21 GMT+03:00 Leonidas Spyropoulos :
> On 01/08/17, E V wrote:
>> In general I think btrfs takes time proportional to the size of your
>> metadata to mount. Bigger and/or fragmented metadata leads to longer
>> mount times. My big backup fs with >300GB of metadata takes over
>> 20minutes to mount, and that's with the space tree which is
>> significantly faster then space cache v1.
>>
> Hmm my raid1 doesn't seem near to full or has a significant Metadata so
> I don't I'm on this case:
>   # btrfs fi show /media/raid1/
>   Label: 'raid1'  uuid: c9db91e6-0ba8-4ae6-b471-8fd4ff7ee72d
>  Total devices 2 FS bytes used 516.18GiB
>  devid1 size 931.51GiB used 518.03GiB path /dev/sdd
>  devid2 size 931.51GiB used 518.03GiB path /dev/sde
>
>   # btrfs fi df /media/raid1/
>   Data, RAID1: total=513.00GiB, used=512.21GiB
>   System, RAID1: total=32.00MiB, used=112.00KiB
>   Metadata, RAID1: total=5.00GiB, used=3.97GiB
>   GlobalReserve, single: total=512.00MiB, used=0.00B
>
> I tried the space_cache=v2 just to see if it would do any
> difference but nothing changed
>   # cat /etc/fstab | grep raid1
>   UUID=c9db91e6-0ba8-4ae6-b471-8fd4ff7ee72d   /media/raid1 btrfs   
> rw,noatime,compress=lzo,space_cache=v2 0 0
>   # time umount /media/raid1 && time mount /media/raid1/
>
>   real0m0.807s
>   user0m0.237s
>   sys 0m0.441s
>
>   real0m5.494s
>   user0m0.618s
>   sys 0m0.116s
>
> I did a couple of rebalances on metadata and data and it improved a bit:
>   # btrfs balance start -musage=100 /media/raid1/
>   # btrfs balance start -dusage=10 /media/raid1/
>   [.. incremental dusage 10 -> 95]
>   # btrfs balance start -dusage=95 /media/raid1
>
> Down to 3.7 sec
>   # time umount /media/raid1 && time mount /media/raid1/
>
>   real0m0.807s
>   user0m0.237s
>   sys 0m0.441s
>
>   real0m3.790s
>   user0m0.430s
>   sys 0m0.031s
>
> I think maybe the next step is to disable compression if I want to mount
> it faster. Is this normal for BTRFS that performance would degrade after
> some time?
>
> Regards,
>
> --
> Leonidas Spyropoulos
>
> A: Because it messes up the order in which people normally read text.
> Q: Why is it such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing on usenet and in e-mail?
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

AFAIK, for space_cache=v2, you need do something like:
btrfs check --clear-space-cache v1 /dev/sdd
mount -o space_cache=v2 /dev/sdd 
First mount will be very slow, because that require rebuild of space_cache

Thanks.
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs incremental send | receive fails with Error: File not found

2017-08-01 Thread Cerem Cem ASLAN

Then following problem is directly related with that:
https://unix.stackexchange.com/questions/377914/how-to-test-if-two-btrfs-snapshots-are-identical

Is that a bug or a feature?


2017-08-01 23:33 GMT+03:00 A L :
>
> On 8/1/2017 10:24 PM, Cerem Cem ASLAN wrote:
>>
>> What is that mean? Can't we replicate the same snapshot with `btrfs send |
>> btrfs receive` multiple times, because it will have a "Received UUID" at the
>> first `btrfs receive
>
>
> You will need to make a new read-write snapshot of the received volume to
> fix it. Any snapshots created from the received subvolume can't be used for
> send-receive again, afaik.
>
> # btrfs subvolume snapshot subvolume.received subvolume
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs incremental send | receive fails with Error: File not found

2017-08-01 Thread A L



On 8/1/2017 10:24 PM, Cerem Cem ASLAN wrote:
What is that mean? Can't we replicate the same snapshot with `btrfs 
send | btrfs receive` multiple times, because it will have a "Received 
UUID" at the first `btrfs receive


You will need to make a new read-write snapshot of the received volume 
to fix it. Any snapshots created from the received subvolume can't be 
used for send-receive again, afaik.


# btrfs subvolume snapshot subvolume.received subvolume
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

cant removed a corrupted dir and some files, btrfs-check crashes

2017-08-01 Thread Paulo Dias

Hi/2 all...

I've been using btrfs for years without any major issues (ok, not true
but it was always my own fault).

This time around i was testing the new BFQ in 4.12 (hint, dont use it
for heavy i/o), and the laptop froze solid. So far so good, the thing
is, once i rebooted i had a corrupted qcow2 image and a dir with some
files that cant be removed or fixed.

The qcow2 i fixed easily, i understand that using compression with
sparse files might result it corruption, so no btrfs fault here.

Now, the "unkillable" dir is another story.

every time i run btrfs scrub  i get:

root@kerberos:~# btrfs check /dev/sda3
Checking filesystem on /dev/sda3
UUID: 619d4eb2-2c94-438b-9b9e-182ed969ad61
checking extents
checking free space cache
checking fs roots
root 258 inode 51958 errors 100, file extent discount
Found file extent holes:
   start: 8192, len: 499712
root 258 inode 4522616 errors 200, dir isize wrong
   unresolved ref dir 4522616 index 3 namelen 84 name
ECRYPTFS_FNEK_ENCRYPTED.FWY9yE0bBJvSm-S7O71XRP3G1sxpTJjXp7LPoLQZgemwYbF7SG.ZpQvWlk--
filetype 1 errors 2, no dir index
root 258 inode 6036422 errors 1, no inode item
   unresolved ref dir 4522616 index 46227 namelen 104 name
ECRYPTFS_FNEK_ENCRYPTED.FXY9yE0bBJvSm-S7O71XRP3G1sxpTJjXp7LPXtMCQ9BwG3JHBHoMOf9hI0EvP6p11X8OCd8Iew1bYMQ-
filetype 1 errors 5, no dir item, no inode
ref
root 258 inode 6036423 errors 1, no inode item
   unresolved ref dir 4522616 index 46229 namelen 84 name
ECRYPTFS_FNEK_ENCRYPTED.FWY9yE0bBJvSm-S7O71XRP3G1sxpTJjXp7LPCSCfQYa2WG4o8T93CrHv0k--
filetype 1 errors 5, no dir item, no inode ref
root 258 inode 8792178 errors 1, no inode item
   unresolved ref dir 4522616 index 133165 namelen 84 name
ECRYPTFS_FNEK_ENCRYPTED.FWY9yE0bBJvSm-S7O71XRP3G1sxpTJjXp7LPCSCfQYa2WG4o8T93CrHv0k--
filetype 1
errors 5, no dir item, no inode ref
root 258 inode 8792183 errors 1, no inode item
   unresolved ref dir 4522616 index 133167 namelen 104 name
ECRYPTFS_FNEK_ENCRYPTED.FXY9yE0bBJvSm-S7O71XRP3G1sxpTJjXp7LPXtMCQ9BwG3JHBHoMOf9hI0EvP6p11X8OCd8Iew1bYMQ-
filetype 1 errors 5, no dir item, no inode
ref
ERROR: errors found in fs roots
found 109814329344 bytes used, error(s) found
total csum bytes: 106513584
total tree bytes: 672759808
total fs tree bytes: 455245824
total extent tree bytes: 82968576
btree space waste bytes: 139453318
file data blocks allocated: 581567266816
referenced 103288545280

as you may have guessed, its a btrfs + ecryptfs mount, i tried to
delete the inodes but the system cant "find" them.

when i try a btrfs check --repair  i get:

root@kerberos:~# btrfs check --repair /dev/sda3
enabling repair mode
Checking filesystem on /dev/sda3
UUID: 619d4eb2-2c94-438b-9b9e-182ed969ad61
checking extents
Unable to find block group for 0
extent-tree.c:287: find_search_start: Warning: assertion `1` failed, value 1
btrfs(+0x20c38)[0x936fcaac38]
btrfs(btrfs_reserve_extent+0x585)[0x936fcaee61]
btrfs(btrfs_alloc_free_block+0x63)[0x936fcaf229]
btrfs(__btrfs_cow_block+0xfe)[0x936fca30b9]
btrfs(btrfs_cow_block+0xc4)[0x936fca366f]
btrfs(+0x1d7ca)[0x936fca77ca]
btrfs(btrfs_commit_transaction+0xac)[0x936fca8f4a]
btrfs(+0x5557b)[0x936fcdf57b]
btrfs(cmd_check+0x1309)[0x936fce09ac]
btrfs(main+0x142)[0x936fca20d9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f956c3153f1]
btrfs(_start+0x2a)[0x936fca211a]
Unable to find block group for 0
extent-tree.c:287: find_search_start: Warning: assertion `1` failed, value 1
btrfs(+0x20c38)[0x936fcaac38]
btrfs(btrfs_reserve_extent+0x585)[0x936fcaee61]
btrfs(btrfs_alloc_free_block+0x63)[0x936fcaf229]
btrfs(__btrfs_cow_block+0xfe)[0x936fca30b9]
btrfs(btrfs_cow_block+0xc4)[0x936fca366f]
btrfs(+0x1d7ca)[0x936fca77ca]
btrfs(btrfs_commit_transaction+0xac)[0x936fca8f4a]
btrfs(+0x5557b)[0x936fcdf57b]
btrfs(cmd_check+0x1309)[0x936fce09ac]
btrfs(main+0x142)[0x936fca20d9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f956c3153f1]
btrfs(_start+0x2a)[0x936fca211a]
Unable to find block group for 0
extent-tree.c:287: find_search_start: Warning: assertion `1` failed, value 1
btrfs(+0x20c38)[0x936fcaac38]
btrfs(btrfs_reserve_extent+0x585)[0x936fcaee61]
btrfs(btrfs_alloc_free_block+0x63)[0x936fcaf229]
btrfs(__btrfs_cow_block+0xfe)[0x936fca30b9]
btrfs(btrfs_cow_block+0xc4)[0x936fca366f]
btrfs(+0x1d7ca)[0x936fca77ca]
btrfs(btrfs_commit_transaction+0xac)[0x936fca8f4a]
btrfs(+0x5557b)[0x936fcdf57b]
btrfs(cmd_check+0x1309)[0x936fce09ac]
btrfs(main+0x142)[0x936fca20d9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f956c3153f1]
btrfs(_start+0x2a)[0x936fca211a]
extent-tree.c:2694: btrfs_reserve_extent: BUG_ON `ret` triggered, value -28
btrfs(+0x20c38)[0x936fcaac38]
btrfs(+0x20ca8)[0x936fcaaca8]
btrfs(+0x20cbb)[0x936fcaacbb]
btrfs(btrfs_reserve_extent+0x751)[0x936fcaf02d]
btrfs(btrfs_alloc_free_block+0x63)[0x936fcaf229]
btrfs(__btrfs_cow_block+0xfe)[0x936fca30b9]
btrfs(btrfs_cow_block+0xc4)[0x936fca366f]
btrfs(+0x1d7ca)[0x936fca77ca]
btrfs(btrfs_commit_transaction+0xac)[0x936fca8f4a]

Re: Btrfs incremental send | receive fails with Error: File not found

2017-08-01 Thread Cerem Cem ASLAN

What is that mean? Can't we replicate the same snapshot with `btrfs
send | btrfs receive` multiple times, because it will have a "Received
UUID" at the first `btrfs receive`?

2017-08-01 15:54 GMT+03:00 A L :
> OK. The problem was that the original subvolume had a "Received UUID". This
> caused all subsequent snapshots to have the same Received UUID which messes
> up Btrfs send | receive. Of course this means I must have used btrfs send |
> receive to create that subvolume and then turned it r/w at some point,
> though I cannot remember ever doing this.
>
> Perhaps a clear notice "WARNING: make sure that the source subvolume does
> not have a Received UUID" on the Wiki would be helpful? Both on
> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup and on
> https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-property
>
> Regards,
> A
>
>
> On 7/28/2017 9:32 PM, Hermann Schwärzler wrote:
>>
>> Hi
>>
>> for me it looks like those snapshots are not read-only. But as far as I
>> know for using send they have to be.
>>
>> At least
>>
>> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup#Initial_Bootstrapping
>> states "We will need to create a read-only snapshot ,,,"
>>
>> I am using send/receive (with read-only snapshots) on a regular basis and
>> never had a problem like yours.
>>
>> What are the commands you use to create your snapshots?
>>
>> Greetings
>> Hermann
>>
>> On 07/28/2017 07:26 PM, A L wrote:
>>>
>>> I often hit the following error when doing incremental btrfs
>>> send-receive:
>>> Btrfs incremental send | receive fails with Error: File not found
>>>
>>> Sometimes I can do two-three incremental snapshots, but then the same
>>> error (different file) happens again. It seems that the files were
>>> changed or replaced between snapshots, which is causing the problems for
>>> send-receive. I have tried to delete all snapshots and started over but
>>> the problem comes back, so I think it must be a bug.
>>>
>>> The source volume is:   /mnt/storagePool (with RAID1 profile)
>>> with subvolume:   volume/userData
>>> Backup disk is:   /media/usb-backup (external USB disk)
>>
>> [...]
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Slow mounting raid1

2017-08-01 Thread Leonidas Spyropoulos

On 01/08/17, E V wrote:
> In general I think btrfs takes time proportional to the size of your
> metadata to mount. Bigger and/or fragmented metadata leads to longer
> mount times. My big backup fs with >300GB of metadata takes over
> 20minutes to mount, and that's with the space tree which is
> significantly faster then space cache v1.
> 
Hmm my raid1 doesn't seem near to full or has a significant Metadata so
I don't I'm on this case:
  # btrfs fi show /media/raid1/
  Label: 'raid1'  uuid: c9db91e6-0ba8-4ae6-b471-8fd4ff7ee72d
 Total devices 2 FS bytes used 516.18GiB
 devid1 size 931.51GiB used 518.03GiB path /dev/sdd
 devid2 size 931.51GiB used 518.03GiB path /dev/sde

  # btrfs fi df /media/raid1/
  Data, RAID1: total=513.00GiB, used=512.21GiB
  System, RAID1: total=32.00MiB, used=112.00KiB
  Metadata, RAID1: total=5.00GiB, used=3.97GiB
  GlobalReserve, single: total=512.00MiB, used=0.00B

I tried the space_cache=v2 just to see if it would do any
difference but nothing changed
  # cat /etc/fstab | grep raid1
  UUID=c9db91e6-0ba8-4ae6-b471-8fd4ff7ee72d   /media/raid1 btrfs   
rw,noatime,compress=lzo,space_cache=v2 0 0
  # time umount /media/raid1 && time mount /media/raid1/

  real0m0.807s
  user0m0.237s
  sys 0m0.441s

  real0m5.494s
  user0m0.618s
  sys 0m0.116s

I did a couple of rebalances on metadata and data and it improved a bit:
  # btrfs balance start -musage=100 /media/raid1/
  # btrfs balance start -dusage=10 /media/raid1/
  [.. incremental dusage 10 -> 95]
  # btrfs balance start -dusage=95 /media/raid1

Down to 3.7 sec
  # time umount /media/raid1 && time mount /media/raid1/

  real0m0.807s
  user0m0.237s
  sys 0m0.441s

  real0m3.790s
  user0m0.430s
  sys 0m0.031s

I think maybe the next step is to disable compression if I want to mount
it faster. Is this normal for BTRFS that performance would degrade after
some time?

Regards,

-- 
Leonidas Spyropoulos

A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Peter Grandi

>> [ ... ] a "RAID5 with 128KiB writes and a 768KiB stripe
>> size". [ ... ] several back-to-back 128KiB writes [ ... ] get
>> merged by the 3ware firmware only if it has a persistent
>> cache, and maybe your 3ware does not have one,

> KOS: No I don't have persistent cache. Only the 512 Mb cache
> on board of a controller, that is BBU.

If it is a persistent cache, that can be battery-backed (as I
wrote, but it seems that you don't have too much time to read
replies) then the size of the write, 128KiB or not, should not
matter much; the write will be reported complete when it hits
the persistent cache (whichever technology it used), and then
the HA fimware will spill write cached data to the disks using
the optimal operation width.

Unless the 3ware firmware is really terrible (and depending on
model and vintage it can be amazingly terrible) or the battery
is no longer recharging and then the host adapter switches to
write-through.

That you see very different rates between uncompressed and
compressed writes, where the main difference is the limitation
on the segment size, seems to indicate that compressed writes
involve a lot of RMW, that is sub-stripe updates. As I mentioned
already, it would be interesting to retry 'dd' with different
'bs' values without compression and with 'sync' (or 'direct'
which only makes sense without compression).

> If I had additional SSD caching on the controller I would have
> mentioned it.

So far you had not mentioned the presence of BBU cache either,
which is equivalent, even if in one of your previous message
(which I try to read carefully) there were these lines:

 Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad 
 BBU
 Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad 
 BBU

So perhaps someone else would have checked long ago the status
of the BBU and whether the "No Write Cache if Bad BBU" case has
happened. If the BBU is still working and the policy is still
"WriteBack" then things are stranger still.

> I was also under impression, that in a situation where mostly
> extra large files will be stored on the massive, the bigger
> strip size would indeed increase the speed, thus I went with
> with the 256 Kb strip size.

That runs counter to this simple story: suppose a program is
doing 64KiB IO:

* For *reads*, there are 4 data drives and the strip size is
  16KiB: the 64KiB will be read in parallel on 4 drives. If the
  strip size is 256KiB then the 64KiB will be read sequentially
  from just one disk, and 4 successive reads will be read
  sequentially from the same drive.

* For *writes* on a parity RAID like RAID5 things are much, much
  more extreme: the 64KiB will be written with 16KiB strips on a
  5-wide RAID5 set in parallel to 5 drives, with 4 stripes being
  updated with RMW. But with 256KiB strips it will partially
  update 5 drives, because the stripe is 1024+256KiB, and it
  needs to do RMW, and four successive 64KiB drives will need to
  do that too, even if only one drive is updated. Usually for
  RAID5 there is an optimization that means that only the
  specific target drive and the parity drives(s) need RMW, but
  it is still very expensive.

This is the "storage for beginners" version, what happens in
practice however depends a lot on specific workload profile
(typical read/write size and latencies and rates), caching and
queueing algorithms in both Linux and the HA firmware.

> Would I be correct in assuming that the RAID strip size of 128
> Kb will be a better choice if one plans to use the BTRFS with
> compression?

That would need to be tested, because of "depends a lot on
specific workload profile, caching and queueing algorithms", but
my expectation is the the lower the better. Given that you have
4 drives giving a 3+1 RAID set, perhaps a 32KiB or 64KiB strip
size, given a data stripe size of 96KiB or 192KiB, would be
better.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] btrfs: increase ctx->pos for delayed dir index

2017-08-01 Thread josef

From: Josef Bacik 

Our dir_context->pos is supposed to hold the next position we're
supposed to look.  If we successfully insert a delayed dir index we
could end up with a duplicate entry because we don't increase ctx->pos
after doing the dir_emit.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 8ae409b..19e4ad2 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1727,6 +1727,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context 
*ctx,
 
if (over)
return 1;
+   ctx->pos++;
}
return 0;
 }
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2][v3] btrfs: fix readdir deadlock with pagefault

2017-08-01 Thread josef

From: Josef Bacik 

Readdir does dir_emit while under the btree lock.  dir_emit can trigger
the page fault which means we can deadlock.  Fix this by allocating a
buffer on opening a directory and copying the readdir into this buffer
and doing dir_emit from outside of the tree lock.

Signed-off-by: Josef Bacik 
---
v2->v3: actually set the filp->private_data properly for ioctl trans.

 fs/btrfs/ctree.h |   5 +++
 fs/btrfs/file.c  |   9 -
 fs/btrfs/inode.c | 107 +--
 fs/btrfs/ioctl.c |  22 
 4 files changed, 109 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5ee9f10..33e942b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1264,6 +1264,11 @@ struct btrfs_root {
atomic64_t qgroup_meta_rsv;
 };
 
+struct btrfs_file_private {
+   struct btrfs_trans_handle *trans;
+   void *filldir_buf;
+};
+
 static inline u32 btrfs_inode_sectorsize(const struct inode *inode)
 {
return btrfs_sb(inode->i_sb)->sectorsize;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0f102a1..1897c3b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1973,8 +1973,15 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 
 int btrfs_release_file(struct inode *inode, struct file *filp)
 {
-   if (filp->private_data)
+   struct btrfs_file_private *private = filp->private_data;
+
+   if (private && private->trans)
btrfs_ioctl_trans_end(filp);
+   if (private && private->filldir_buf)
+   kfree(private->filldir_buf);
+   kfree(private);
+   filp->private_data = NULL;
+
/*
 * ordered_data_close is set by settattr when we are about to truncate
 * a file from a non-zero size to a zero size.  This tries to
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9a4413a..bbdbeea 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5877,25 +5877,73 @@ unsigned char btrfs_filetype_table[] = {
DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
 };
 
+/*
+ * All this infrastructure exists because dir_emit can fault, and we are 
holding
+ * the tree lock when doing readdir.  For now just allocate a buffer and copy
+ * our information into that, and then dir_emit from the buffer.  This is
+ * similar to what NFS does, only we don't keep the buffer around in pagecache
+ * because I'm afraid I'll fuck that up.  Long term we need to make filldir do
+ * copy_to_user_inatomic so we don't have to worry about page faulting under 
the
+ * tree lock.
+ */
+static int btrfs_opendir(struct inode *inode, struct file *file)
+{
+   struct btrfs_file_private *private;
+
+   private = kzalloc(sizeof(struct btrfs_file_private), GFP_KERNEL);
+   if (!private)
+   return -ENOMEM;
+   private->filldir_buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
+   if (!private->filldir_buf) {
+   kfree(private);
+   return -ENOMEM;
+   }
+   file->private_data = private;
+   return 0;
+}
+
+struct dir_entry {
+   u64 ino;
+   u64 offset;
+   unsigned type;
+   int name_len;
+};
+
+static int btrfs_filldir(void *addr, int entries, struct dir_context *ctx)
+{
+   while (entries--) {
+   struct dir_entry *entry = addr;
+   char *name = (char *)(entry + 1);
+   ctx->pos = entry->offset;
+   if (!dir_emit(ctx, name, entry->name_len, entry->ino,
+ entry->type))
+   return 1;
+   addr += sizeof(struct dir_entry) + entry->name_len;
+   ctx->pos++;
+   }
+   return 0;
+}
+
 static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
 {
struct inode *inode = file_inode(file);
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_file_private *private = file->private_data;
struct btrfs_dir_item *di;
struct btrfs_key key;
struct btrfs_key found_key;
struct btrfs_path *path;
+   void *addr;
struct list_head ins_list;
struct list_head del_list;
int ret;
struct extent_buffer *leaf;
int slot;
-   unsigned char d_type;
-   int over = 0;
-   char tmp_name[32];
char *name_ptr;
int name_len;
+   int entries = 0;
+   int total_len = 0;
bool put = false;
struct btrfs_key location;
 
@@ -5906,12 +5954,14 @@ static int btrfs_real_readdir(struct file *file, struct 
dir_context *ctx)
if (!path)
return -ENOMEM;
 
+   addr = private->filldir_buf;
path->reada = READA_FORWARD;
 
INIT_LIST_HEAD(_list);
INIT_LIST_HEAD(_list);
put = btrfs_readdir_get_delayed_items(inode, _list, _list);
 
+again:
key.type = BTRFS_DIR_INDEX_KEY;

Re: Odd fallocate behavior on BTRFS.

2017-08-01 Thread Austin S. Hemmelgarn


On 2017-08-01 15:07, Holger Hoffstätte wrote:

On 08/01/17 20:15, Holger Hoffstätte wrote:

On 08/01/17 19:34, Austin S. Hemmelgarn wrote:
[..]

Apparently, if you call fallocate() on a file with an offset of 0 and
a length longer than the length of the file itself, BTRFS will
allocate that exact amount of space, instead of just filling in holes
in the file and allocating space to extend it.  If there isn't enough
space on the filesystem for this, then it will fail, even though it
would succeed on ext4, XFS, and F2FS.

[..]

I'm curious to hear anybody's thoughts on this, namely: 1. Is this
behavior that should be considered implementation defined? 2. If not,
is my assessment that BTRFS is behaving incorrectly in this case
accurate?


IMHO no and yes, respectively. Both fallocate(2) and posix_fallocate(3)
make it very clear that the expected default behaviour is to extend.
I don't think this can be interpreted in any other way than incorrect
behaviour on behalf of btrfs.

Your script reproduces for me, so that's a start.


Your reproducer should never ENOSPC because it requires exactly 0 new bytes
to be allocated, yet it also fails with --keep-size.
Unless I'm doing the math wrong, it should require exactly 2 new bytes. 
65536 (the block size for dd) times 32768 (the block count for dd) is 
2147483648 (2^31), while the fallocate call requests a total size of 
2147483650 bytes.  It may not need to allocate a new block, but it 
should definitely be extending the file.>

 From a quick look it seems that btrfs_fallocate() unconditionally calls
btrfs_alloc_data_chunk_ondemand(inode, alloc_end - alloc_start) to lazily
allocate the necessary extent(s), which goes ENOSPC because that size
is again the full size of the requested range, not the difference between
the existing file size and the new range length.
But I might be misreading things..
As far as I can tell, that is correct.  However, we can't just extend 
the range, because the existing file might have sparse regions, and 
those need to have allocations forced too (and based on the code, this 
will also cause issues any time the fallocate range includes already 
allocated extents, so I don't think it can be special cased either).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Odd fallocate behavior on BTRFS.

2017-08-01 Thread Holger Hoffstätte

On 08/01/17 20:15, Holger Hoffstätte wrote:
> On 08/01/17 19:34, Austin S. Hemmelgarn wrote:
> [..]
>> Apparently, if you call fallocate() on a file with an offset of 0 and
>> a length longer than the length of the file itself, BTRFS will
>> allocate that exact amount of space, instead of just filling in holes
>> in the file and allocating space to extend it.  If there isn't enough
>> space on the filesystem for this, then it will fail, even though it
>> would succeed on ext4, XFS, and F2FS.
> [..]
>> I'm curious to hear anybody's thoughts on this, namely: 1. Is this
>> behavior that should be considered implementation defined? 2. If not,
>> is my assessment that BTRFS is behaving incorrectly in this case
>> accurate?
> 
> IMHO no and yes, respectively. Both fallocate(2) and posix_fallocate(3)
> make it very clear that the expected default behaviour is to extend.
> I don't think this can be interpreted in any other way than incorrect
> behaviour on behalf of btrfs.
> 
> Your script reproduces for me, so that's a start.

Your reproducer should never ENOSPC because it requires exactly 0 new bytes
to be allocated, yet it also fails with --keep-size.

>From a quick look it seems that btrfs_fallocate() unconditionally calls
btrfs_alloc_data_chunk_ondemand(inode, alloc_end - alloc_start) to lazily
allocate the necessary extent(s), which goes ENOSPC because that size
is again the full size of the requested range, not the difference between
the existing file size and the new range length. 
But I might be misreading things..

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid0 rescue

2017-08-01 Thread Chris Murphy

On Tue, Aug 1, 2017 at 12:36 PM, Alan Brand  wrote:
> I successfully repaired the superblock, copied it from one of the backups.
> My biggest problem now is that the UUID for the disk has changed due
> to the reformatting and no longer matches what is in the metadata.
> I need to make linux recognize the partition as btrfs and have the correct 
> UUID.
> Any suggestions?

Huh, insofar as I'm aware, Btrfs does not track a "disk" UUID or partition UUID.

A better qualified set of steps for fixing this would be:

a.) restore partitioning, if any
b.) wipefs the NTFS signature to invalidate the NTFS file system
c.) use super-recover to replace correct supers on both drives
d.) mount the file system
e.) do a full scrub

The last step is optional but best practice. It'll actively do fixups,
and you'll get an error message with path to files that are not
recoverable.

Alternatively a metadata only balance will do fixups, and it'll be
much faster. But you won't get info right away about what files are
damaged.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kohler

2017-08-01 Thread emailuser

你有美元，教你躺着赚钱:
您的朋友在科勒（中国）官网上找到一个认为您可能感兴趣的内容，并分享给您:
链接地址:http://www.kohler.com.cn/product/K-1809T-0/K-1809T-0/
他对您留言:高价收美元，先结人民币给你，再结美金，让您无后顾之忧，详情请加黄生:扣 扣 7825 53723，手 
机:135-3541-5522，V信同号，欢迎前来咨询

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Liu Bo

On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
> Hi Liu,
> 
> On 2017-08-01 18:14, Liu Bo wrote:
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> > 
> 
> it would be possible to have more information ?
> - what is logged ? data, parity or data + parity ?

Patch 5 has more details(sorry for not making it clear that in the
cover letter).

So both data and parity are logged so that while replaying the journal
everything is written to whichever disk it should be written to.

> - in the past I thought that it would be sufficient to log only the stripe 
> position involved by a RMW cycle, and then start a scrub on these stripes in 
> case of an unclean shutdown: do you think that it is feasible ?

An unclean shutdown causes inconsistence between data and parity, so
scrub won't help as it's not able to tell which one (data or parity)
is valid.

With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
With datacow, we don't do overwrite, but the following situation may happen,
say we have a raid5 setup with 3 disks, the stripe length is 64k, so

1) write 64K  --> now the raid layout is
[64K data + 64K random + 64K parity]
2) write another 64K --> now the raid layout after RMW is
[64K 1)'s data + 64K 2)'s data + 64K new parity]

If unclean shutdown occurs before 2) finishes, then parity may be
corrupted and then 1)'s data may be recovered wrongly if the disk
which holds 1)'s data is offline.

> - does this journal disk also host other btrfs log ?
>

No, purely data/parity and some associated metadata.

Thanks,

-liubo

> > The idea and the code are similar to the write-through mode of md
> > raid5-cache, so ppl(partial parity log) is also feasible to implement.
> > (If you've been familiar with md, you may find this patch set is
> > boring to read...)
> > 
> > Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> > the implementation, the rest patches are improvements and bugfixes,
> > eg. readahead for recovery, checksum.
> > 
> > Two btrfs-progs patches are required to play with this patch set, one
> > is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> > option '-L', the other is to teach 'btrfs-show-super' to show
> > %journal_tail.
> > 
> > This is currently based on 4.12-rc3.
> > 
> > The patch set is tagged with RFC, and comments are always welcome,
> > thanks.
> > 
> > Known limitations:
> > - Deleting a log device is not implemented yet.
> > 
> > 
> > Liu Bo (14):
> >   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
> >   Btrfs: raid56: do not allocate chunk on raid56 log
> >   Btrfs: raid56: detect raid56 log on mount
> >   Btrfs: raid56: add verbose debug
> >   Btrfs: raid56: add stripe log for raid5/6
> >   Btrfs: raid56: add reclaim support
> >   Btrfs: raid56: load r5log
> >   Btrfs: raid56: log recovery
> >   Btrfs: raid56: add readahead for recovery
> >   Btrfs: raid56: use the readahead helper to get page
> >   Btrfs: raid56: add csum support
> >   Btrfs: raid56: fix error handling while adding a log device
> >   Btrfs: raid56: initialize raid5/6 log after adding it
> >   Btrfs: raid56: maintain IO order on raid5/6 log
> > 
> >  fs/btrfs/ctree.h|   16 +-
> >  fs/btrfs/disk-io.c  |   16 +
> >  fs/btrfs/ioctl.c|   48 +-
> >  fs/btrfs/raid56.c   | 1429 
> > ++-
> >  fs/btrfs/raid56.h   |   82 +++
> >  fs/btrfs/transaction.c  |2 +
> >  fs/btrfs/volumes.c  |   56 +-
> >  fs/btrfs/volumes.h  |7 +-
> >  include/uapi/linux/btrfs.h  |3 +
> >  include/uapi/linux/btrfs_tree.h |4 +
> >  10 files changed, 1487 insertions(+), 176 deletions(-)
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid0 rescue

2017-08-01 Thread Chris Murphy

On Thu, Jul 27, 2017 at 8:49 AM, Alan Brand  wrote:
> I know I am screwed but hope someone here can point at a possible solution.
>
> I had a pair of btrfs drives in a raid0 configuration.  One of the
> drives was pulled by mistake, put in a windows box, and a quick NTFS
> format was done.  Then much screaming occurred.
>
> I know the data is still there.  Is there anyway to rebuild the raid
> bringing in the bad disk?  I know some info is still good, for example
> metadata0 is corrupt but 1 and 2 are good.
> The trees look bad which is probably the killer.

Well the first step is to check and fix the super blocks. And then the
normal code should just discover the bad stuff, and get good copies
from the good drive, and copy them to the corrupt one, passively, and
eventually fix the file system itself. There's probably only a few
files corrupted irrecoverably.

It's probably worth testing for this explicitly. It's not a wild
scenario, and it's something Btrfs should be able to recover from
gracefully. The gotcha part of a totally automatic recovery is the
superblocks because there's no *one true right way* for the kernel to
just assume the remaining Btrfs supers are more valid than the NTFS
supers.

So then the question is, which tool should fix this up? I'd say both
'btrfs rescue super-recover' and 'btrfs check' should do this. The
difference being super-recover would fix only the supers, with kernel
code doing passive fixups as problems are encountered once the fs is
mounted. And 'check --repair' would fix supers and additionally fix
missing metadata on the corrupt drive, using user space code with an
unmounted system.

Both should work, or at least both should be fail safe.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Hugo Mills

On Tue, Aug 01, 2017 at 10:56:39AM -0600, Liu Bo wrote:
> On Tue, Aug 01, 2017 at 05:28:57PM +, Hugo Mills wrote:
> >Hi,
> > 
> >Great to see something addressing the write hole at last.
> > 
> > On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > > separate disk as a journal (aka raid5/6 log), so that after unclean
> > > shutdown we can make sure data and parity are consistent on the raid
> > > array by replaying the journal.
> > 
> >What's the behaviour of the FS if the log device dies during use?
> >
> 
> Error handling on IOs is still under construction (belongs to known
> limitations).
> 
> If the log device dies suddenly, I think we could skip the writeback
> to backend raid arrays and follow the rule in btrfs, filp FS to
> readonly as it may expose data loss.  What do you think?

   I think the key thing for me is that the overall behaviour of the
redundancy in the FS is not compromised by the logging solution. That
is, the same guarantees still hold: For RAID-5, you can lose up to one
device of the FS (*including* any log devices), and the FS will
continue to operate normally, but degraded. For RAID-6, you can lose
up to two devices without losing any capabilities of the FS. Dropping
to read-only if the (single) log device fails would break those
guarantees.

   I quite like the idea of embedding the log chunks into the
allocated structure of the FS -- although as pointed out, this is
probably going to need a new chunk type, and (to retain the guarantees
of the RAID-6 behaviour above) the ability to do 3-way RAID-1 on those
chunks. You'd also have to be able to balance the log structures while
in flight. It sounds like a lot more work for you, though.

   Hmm... if 3-way RAID-1 (3c) is available, then you could also have
RAID-1*3 on metadata, RAID-6 on data, and have 2-device redundancy
throughout. That's also a very attractive configuration in many
respects. (Analagous to RAID-1 metadata and RAID-5 data).

   Hugo.

-- 
Hugo Mills | That's not rain, that's a lake with slots in it.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |

signature.asc
Description: Digital signature

Re: Odd fallocate behavior on BTRFS.

2017-08-01 Thread Holger Hoffstätte

On 08/01/17 19:34, Austin S. Hemmelgarn wrote:
[..]
> Apparently, if you call fallocate() on a file with an offset of 0 and
> a length longer than the length of the file itself, BTRFS will
> allocate that exact amount of space, instead of just filling in holes
> in the file and allocating space to extend it.  If there isn't enough
> space on the filesystem for this, then it will fail, even though it
> would succeed on ext4, XFS, and F2FS.
[..]
> I'm curious to hear anybody's thoughts on this, namely: 1. Is this
> behavior that should be considered implementation defined? 2. If not,
> is my assessment that BTRFS is behaving incorrectly in this case
> accurate?

IMHO no and yes, respectively. Both fallocate(2) and posix_fallocate(3)
make it very clear that the expected default behaviour is to extend.
I don't think this can be interpreted in any other way than incorrect
behaviour on behalf of btrfs.

Your script reproduces for me, so that's a start.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Konstantin V. Gavrilenko

- Original Message -
From: "Peter Grandi" 
To: "Linux fs Btrfs" 
Sent: Tuesday, 1 August, 2017 3:14:07 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage

> Peter, I don't think the filefrag is showing the correct
> fragmentation status of the file when the compression is used.

As I wrote, "their size is just limited by the compression code"
which results in "128KiB writes". On a "fresh empty Btrfs volume"
the compressed extents limited to 128KiB also happen to be pretty
physically contiguous, but on a more fragmented free space list
they can be more scattered.

KOS: Ok, thanks for pointing it out. I have compared the filefrag -v on another 
btrfs  that is not fragmented
and see the difference with what is happening on the sluggish one.

5824:   186368..  186399: 2430093383..2430093414: 32: 2430093414: encoded
5825:   186400..  186431: 2430093384..2430093415: 32: 2430093415: encoded
5826:   186432..  186463: 2430093385..2430093416: 32: 2430093416: encoded
5827:   186464..  186495: 2430093386..2430093417: 32: 2430093417: encoded
5828:   186496..  186527: 2430093387..2430093418: 32: 2430093418: encoded
5829:   186528..  186559: 2430093388..2430093419: 32: 2430093419: encoded
5830:   186560..  186591: 2430093389..2430093420: 32: 2430093420: encoded

As I already wrote the main issue here seems to be that we are
talking about a "RAID5 with 128KiB writes and a 768KiB stripe
size". On MD RAID5 the slowdown because of RMW seems only to be
around 30-40%, but it looks like that several back-to-back 128KiB
writes get merged by the Linux IO subsystem (not sure whether
that's thoroughly legal), and perhaps they get merged by the 3ware
firmware only if it has a persistent cache, and maybe your 3ware
does not have one, but you have kept your counsel as to that.

KOS: No I don't have persistent cache. Only the 512 Mb cache on board of a 
controller, that is 
BBU. If I had additional SSD caching on the controller I would have mentioned 
it.

I was also under impression, that in a situation where mostly extra large files 
will be stored on the massive, the bigger strip size would indeed increase the 
speed, thus I went with with the 256 Kb strip size.  Would I be correct in 
assuming that the RAID strip size of 128 Kb will be a better choice if one 
plans to use the BTRFS with compression?

thanks,
kos

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Liu Bo

On Tue, Aug 01, 2017 at 01:39:59PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-08-01 13:25, Roman Mamedov wrote:
> > On Tue,  1 Aug 2017 10:14:23 -0600
> > Liu Bo  wrote:
> > 
> > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > > separate disk as a journal (aka raid5/6 log), so that after unclean
> > > shutdown we can make sure data and parity are consistent on the raid
> > > array by replaying the journal.
> > 
> > Could it be possible to designate areas on the in-array devices to be used 
> > as
> > journal?
> > 
> > While md doesn't have much spare room in its metadata for extraneous things
> > like this, Btrfs could use almost as much as it wants to, adding to size of 
> > the
> > FS metadata areas. Reliability-wise, the log could be stored as RAID1 
> > chunks.
> > 
> > It doesn't seem convenient to need having an additional storage device 
> > around
> > just for the log, and also needing to maintain its fault tolerance yourself 
> > (so
> > the log device would better be on a mirror, such as mdadm RAID1? more 
> > expense
> > and maintenance complexity).
> > 
> I agree, MD pretty much needs a separate device simply because they can't
> allocate arbitrary space on the other array members.  BTRFS can do that
> though, and I would actually think that that would be _easier_ to implement
> than having a separate device.
>

Yes and no, using chunks may need a new ioctl and diving into chunk
allocation/(auto)deletion maze.

> That said, I do think that it would need to be a separate chunk type,
> because things could get really complicated if the metadata is itself using
> a parity raid profile.

Exactly, esp. when balance comes into the picture.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What are the typical usecase of "btrfs check --init-extent-tree"?

2017-08-01 Thread Chris Murphy

On Thu, Jul 27, 2017 at 9:33 AM, Ivan Sizov  wrote:
> I've just noticed a huge number of errors on one of the RAID's disks.
> "btrfs dev stats" gives:
>
> [/dev/sdc1].write_io_errs0
> [/dev/sdc1].read_io_errs 305
> [/dev/sdc1].flush_io_errs0
> [/dev/sdc1].corruption_errs  429
> [/dev/sdc1].generation_errs  0
>
> [/dev/sda1].write_io_errs58331
> [/dev/sda1].read_io_errs 57438
> [/dev/sda1].flush_io_errs37
> [/dev/sda1].corruption_errs  10110
> [/dev/sda1].generation_errs  0

You'll need to translate the sda device to ata device, and then do a
search for kernel messages. I suspect persistent bad sector on this
drive, and write failures are always disqualifying but Btrfs won't
eject this device.

Read errors are not a big problem. Read errors along with corruptions
aren't necessarily a big problem. Write errors are a big problem.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Liu Bo

On Tue, Aug 01, 2017 at 10:25:47PM +0500, Roman Mamedov wrote:
> On Tue,  1 Aug 2017 10:14:23 -0600
> Liu Bo  wrote:
> 
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> 
> Could it be possible to designate areas on the in-array devices to be used as
> journal?
>
> While md doesn't have much spare room in its metadata for extraneous things
> like this, Btrfs could use almost as much as it wants to, adding to size of 
> the
> FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.
> 

Yes, it makes sense, we could definitely do that, that was actually
the original idea.  I started with adding a new device for log as it
looks easier to me, but I could try that now.

> It doesn't seem convenient to need having an additional storage device around
> just for the log, and also needing to maintain its fault tolerance yourself 
> (so
> the log device would better be on a mirror, such as mdadm RAID1? more expense
> and maintenance complexity).
>

That's true.  Thanks for the suggestions.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Liu Bo

On Tue, Aug 01, 2017 at 05:28:57PM +, Hugo Mills wrote:
>Hi,
> 
>Great to see something addressing the write hole at last.
> 
> On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> 
>What's the behaviour of the FS if the log device dies during use?
>

Error handling on IOs is still under construction (belongs to known
limitations).

If the log device dies suddenly, I think we could skip the writeback
to backend raid arrays and follow the rule in btrfs, filp FS to
readonly as it may expose data loss.  What do you think?

Thanks,

-liubo

>Hugo.
> 
> > The idea and the code are similar to the write-through mode of md
> > raid5-cache, so ppl(partial parity log) is also feasible to implement.
> > (If you've been familiar with md, you may find this patch set is
> > boring to read...)
> > 
> > Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> > the implementation, the rest patches are improvements and bugfixes,
> > eg. readahead for recovery, checksum.
> > 
> > Two btrfs-progs patches are required to play with this patch set, one
> > is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> > option '-L', the other is to teach 'btrfs-show-super' to show
> > %journal_tail.
> > 
> > This is currently based on 4.12-rc3.
> > 
> > The patch set is tagged with RFC, and comments are always welcome,
> > thanks.
> > 
> > Known limitations:
> > - Deleting a log device is not implemented yet.
> > 
> > 
> > Liu Bo (14):
> >   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
> >   Btrfs: raid56: do not allocate chunk on raid56 log
> >   Btrfs: raid56: detect raid56 log on mount
> >   Btrfs: raid56: add verbose debug
> >   Btrfs: raid56: add stripe log for raid5/6
> >   Btrfs: raid56: add reclaim support
> >   Btrfs: raid56: load r5log
> >   Btrfs: raid56: log recovery
> >   Btrfs: raid56: add readahead for recovery
> >   Btrfs: raid56: use the readahead helper to get page
> >   Btrfs: raid56: add csum support
> >   Btrfs: raid56: fix error handling while adding a log device
> >   Btrfs: raid56: initialize raid5/6 log after adding it
> >   Btrfs: raid56: maintain IO order on raid5/6 log
> > 
> >  fs/btrfs/ctree.h|   16 +-
> >  fs/btrfs/disk-io.c  |   16 +
> >  fs/btrfs/ioctl.c|   48 +-
> >  fs/btrfs/raid56.c   | 1429 
> > ++-
> >  fs/btrfs/raid56.h   |   82 +++
> >  fs/btrfs/transaction.c  |2 +
> >  fs/btrfs/volumes.c  |   56 +-
> >  fs/btrfs/volumes.h  |7 +-
> >  include/uapi/linux/btrfs.h  |3 +
> >  include/uapi/linux/btrfs_tree.h |4 +
> >  10 files changed, 1487 insertions(+), 176 deletions(-)
> > 
> 
> -- 
> Hugo Mills | Some days, it's just not worth gnawing through the
> hugo@... carfax.org.uk | straps
> http://carfax.org.uk/  |
> PGP: E2AB1DE4  |


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Goffredo Baroncelli

Hi Liu,

On 2017-08-01 18:14, Liu Bo wrote:
> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.
> 

it would be possible to have more information ?
- what is logged ? data, parity or data + parity ?
- in the past I thought that it would be sufficient to log only the stripe 
position involved by a RMW cycle, and then start a scrub on these stripes in 
case of an unclean shutdown: do you think that it is feasible ?
- does this journal disk also host other btrfs log ?

> The idea and the code are similar to the write-through mode of md
> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> (If you've been familiar with md, you may find this patch set is
> boring to read...)
> 
> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> the implementation, the rest patches are improvements and bugfixes,
> eg. readahead for recovery, checksum.
> 
> Two btrfs-progs patches are required to play with this patch set, one
> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> option '-L', the other is to teach 'btrfs-show-super' to show
> %journal_tail.
> 
> This is currently based on 4.12-rc3.
> 
> The patch set is tagged with RFC, and comments are always welcome,
> thanks.
> 
> Known limitations:
> - Deleting a log device is not implemented yet.
> 
> 
> Liu Bo (14):
>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>   Btrfs: raid56: do not allocate chunk on raid56 log
>   Btrfs: raid56: detect raid56 log on mount
>   Btrfs: raid56: add verbose debug
>   Btrfs: raid56: add stripe log for raid5/6
>   Btrfs: raid56: add reclaim support
>   Btrfs: raid56: load r5log
>   Btrfs: raid56: log recovery
>   Btrfs: raid56: add readahead for recovery
>   Btrfs: raid56: use the readahead helper to get page
>   Btrfs: raid56: add csum support
>   Btrfs: raid56: fix error handling while adding a log device
>   Btrfs: raid56: initialize raid5/6 log after adding it
>   Btrfs: raid56: maintain IO order on raid5/6 log
> 
>  fs/btrfs/ctree.h|   16 +-
>  fs/btrfs/disk-io.c  |   16 +
>  fs/btrfs/ioctl.c|   48 +-
>  fs/btrfs/raid56.c   | 1429 
> ++-
>  fs/btrfs/raid56.h   |   82 +++
>  fs/btrfs/transaction.c  |2 +
>  fs/btrfs/volumes.c  |   56 +-
>  fs/btrfs/volumes.h  |7 +-
>  include/uapi/linux/btrfs.h  |3 +
>  include/uapi/linux/btrfs_tree.h |4 +
>  10 files changed, 1487 insertions(+), 176 deletions(-)
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Austin S. Hemmelgarn


On 2017-08-01 13:25, Roman Mamedov wrote:

On Tue,  1 Aug 2017 10:14:23 -0600
Liu Bo  wrote:


This aims to fix write hole issue on btrfs raid5/6 setup by adding a
separate disk as a journal (aka raid5/6 log), so that after unclean
shutdown we can make sure data and parity are consistent on the raid
array by replaying the journal.


Could it be possible to designate areas on the in-array devices to be used as
journal?

While md doesn't have much spare room in its metadata for extraneous things
like this, Btrfs could use almost as much as it wants to, adding to size of the
FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.

It doesn't seem convenient to need having an additional storage device around
just for the log, and also needing to maintain its fault tolerance yourself (so
the log device would better be on a mirror, such as mdadm RAID1? more expense
and maintenance complexity).

I agree, MD pretty much needs a separate device simply because they 
can't allocate arbitrary space on the other array members.  BTRFS can do 
that though, and I would actually think that that would be _easier_ to 
implement than having a separate device.


That said, I do think that it would need to be a separate chunk type, 
because things could get really complicated if the metadata is itself 
using a parity raid profile.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Odd fallocate behavior on BTRFS.

2017-08-01 Thread Austin S. Hemmelgarn

A recent thread on the BTRFS mailing list [1] brought up some odd 
behavior in BTRFS that I've long suspected but not had prior reason to 
test.  I've put the fsdevel mailing list on CC since I'm curious to hear 
what people there think about this.


Apparently, if you call fallocate() on a file with an offset of 0 and a 
length longer than the length of the file itself, BTRFS will allocate 
that exact amount of space, instead of just filling in holes in the file 
and allocating space to extend it.  If there isn't enough space on the 
filesystem for this, then it will fail, even though it would succeed on 
ext4, XFS, and F2FS.  The following script demonstrates this:


#!/bin/bash
touch ./test-fs
truncate --size=4G ./test-fs
mkfs.btrfs ./test-fs
mkdir ./test
mount -t auto ./test-fs ./test
dd if=/dev/zero of=./test/test bs=65536 count=32768
fallocate -l 2147483650 ./test/test && echo "Success!"
umount ./test
rmdir ./test
rm -f ./test-fs

This will spit out a -ENOSPC error from the fallocate call, but if you 
change the mkfs call to ext4, XFS, or F2FS, it will instead succeed 
without error.  If the fallocate call is changed to `fallocate -o 
2147483648 -l 2 ./test/test`, it will succeed on all filesystems.


I have not yet done any testing to determine if this also applies for 
offsets other than 0, but I suspect it does (it would be kind of odd if 
it didn't).


My thought on this is that the behavior that BTRFS exhibits is incorrect 
in this case, at a minimum because it does not follow the apparent 
de-facto standard, and because it keeps some things from working (the OP 
in the thread that resulted in me finding this was having issues trying 
to extend a SnapRAID parity file that was already larger than half the 
size of the BTRFS volume it was stored on).


I'm curious to hear anybody's thoughts on this, namely:
1. Is this behavior that should be considered implementation defined?
2. If not, is my assessment that BTRFS is behaving incorrectly in this 
case accurate?



[1] https://marc.info/?l=linux-btrfs=150158963921123=2
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Hugo Mills

   Hi,

   Great to see something addressing the write hole at last.

On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.

   What's the behaviour of the FS if the log device dies during use?

   Hugo.

> The idea and the code are similar to the write-through mode of md
> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> (If you've been familiar with md, you may find this patch set is
> boring to read...)
> 
> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> the implementation, the rest patches are improvements and bugfixes,
> eg. readahead for recovery, checksum.
> 
> Two btrfs-progs patches are required to play with this patch set, one
> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> option '-L', the other is to teach 'btrfs-show-super' to show
> %journal_tail.
> 
> This is currently based on 4.12-rc3.
> 
> The patch set is tagged with RFC, and comments are always welcome,
> thanks.
> 
> Known limitations:
> - Deleting a log device is not implemented yet.
> 
> 
> Liu Bo (14):
>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>   Btrfs: raid56: do not allocate chunk on raid56 log
>   Btrfs: raid56: detect raid56 log on mount
>   Btrfs: raid56: add verbose debug
>   Btrfs: raid56: add stripe log for raid5/6
>   Btrfs: raid56: add reclaim support
>   Btrfs: raid56: load r5log
>   Btrfs: raid56: log recovery
>   Btrfs: raid56: add readahead for recovery
>   Btrfs: raid56: use the readahead helper to get page
>   Btrfs: raid56: add csum support
>   Btrfs: raid56: fix error handling while adding a log device
>   Btrfs: raid56: initialize raid5/6 log after adding it
>   Btrfs: raid56: maintain IO order on raid5/6 log
> 
>  fs/btrfs/ctree.h|   16 +-
>  fs/btrfs/disk-io.c  |   16 +
>  fs/btrfs/ioctl.c|   48 +-
>  fs/btrfs/raid56.c   | 1429 
> ++-
>  fs/btrfs/raid56.h   |   82 +++
>  fs/btrfs/transaction.c  |2 +
>  fs/btrfs/volumes.c  |   56 +-
>  fs/btrfs/volumes.h  |7 +-
>  include/uapi/linux/btrfs.h  |3 +
>  include/uapi/linux/btrfs_tree.h |4 +
>  10 files changed, 1487 insertions(+), 176 deletions(-)
> 

-- 
Hugo Mills | Some days, it's just not worth gnawing through the
hugo@... carfax.org.uk | straps
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Roman Mamedov

On Tue,  1 Aug 2017 10:14:23 -0600
Liu Bo  wrote:

> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.

Could it be possible to designate areas on the in-array devices to be used as
journal?

While md doesn't have much spare room in its metadata for extraneous things
like this, Btrfs could use almost as much as it wants to, adding to size of the
FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.

It doesn't seem convenient to need having an additional storage device around
just for the log, and also needing to maintain its fault tolerance yourself (so
the log device would better be on a mirror, such as mdadm RAID1? more expense
and maintenance complexity).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/14] Btrfs: raid56: fix error handling while adding a log device

2017-08-01 Thread Liu Bo

Currently there is a memory leak if we have an error while adding a
raid5/6 log.  Moreover, it didn't abort the transaction as others do,
so this is fixing the broken error handling by applying two steps on
initializing the log, step #1 is to allocate memory, check if it has a
proper size, and step #2 is to assign the pointer in %fs_info.  And by
running step #1 ahead of starting transaction, we can gracefully bail
out on errors now.

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c  | 48 +---
 fs/btrfs/raid56.h  |  5 +
 fs/btrfs/volumes.c | 36 ++--
 3 files changed, 68 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 8bc7ba4..0bfc97a 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -3711,30 +3711,64 @@ void raid56_submit_missing_rbio(struct btrfs_raid_bio 
*rbio)
async_missing_raid56(rbio);
 }
 
-int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device *device)
+struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct btrfs_fs_info 
*fs_info, struct btrfs_device *device, struct block_device *bdev)
 {
-   struct btrfs_r5l_log *log;
-
-   log = kzalloc(sizeof(*log), GFP_NOFS);
+   int num_devices = fs_info->fs_devices->num_devices;
+   u64 dev_total_bytes;
+   struct btrfs_r5l_log *log = kzalloc(sizeof(struct btrfs_r5l_log), 
GFP_NOFS);
if (!log)
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
+
+   ASSERT(device);
+   ASSERT(bdev);
+   dev_total_bytes = i_size_read(bdev->bd_inode);
 
/* see find_free_dev_extent for 1M start offset */
log->data_offset = 1024ull * 1024;
-   log->device_size = btrfs_device_get_total_bytes(device) - 
log->data_offset;
+   log->device_size = dev_total_bytes - log->data_offset;
log->device_size = round_down(log->device_size, PAGE_SIZE);
+
+   /*
+* when device has been included in fs_devices, do not take
+* into account this device when checking log size.
+*/
+   if (device->in_fs_metadata)
+   num_devices--;
+
+   if (log->device_size < BTRFS_STRIPE_LEN * num_devices * 2) {
+   btrfs_info(fs_info, "r5log log device size (%llu < %llu) is too 
small", log->device_size, BTRFS_STRIPE_LEN * num_devices * 2);
+   kfree(log);
+   return ERR_PTR(-EINVAL);
+   }
+
log->dev = device;
log->fs_info = fs_info;
ASSERT(sizeof(device->uuid) == BTRFS_UUID_SIZE);
log->uuid_csum = btrfs_crc32c(~0, device->uuid, sizeof(device->uuid));
mutex_init(>io_mutex);
 
+   return log;
+}
+
+void btrfs_r5l_init_log_post(struct btrfs_fs_info *fs_info, struct 
btrfs_r5l_log *log)
+{
cmpxchg(_info->r5log, NULL, log);
ASSERT(fs_info->r5log == log);
 
 #ifdef BTRFS_DEBUG_R5LOG
-   trace_printk("r5log: set a r5log in fs_info,  alloc_range 0x%llx 
0x%llx",
+   trace_printk("r5log: set a r5log in fs_info,  alloc_range 0x%llx 
0x%llx\n",
 log->data_offset, log->data_offset + log->device_size);
 #endif
+}
+
+int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device *device)
+{
+   struct btrfs_r5l_log *log;
+
+   log = btrfs_r5l_init_log_prepare(fs_info, device, device->bdev);
+   if (IS_ERR(log))
+   return PTR_ERR(log);
+
+   btrfs_r5l_init_log_post(fs_info, log);
return 0;
 }
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 569cec8..f6d6f36 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -134,6 +134,11 @@ void raid56_submit_missing_rbio(struct btrfs_raid_bio 
*rbio);
 
 int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info);
 void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info);
+struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct btrfs_fs_info 
*fs_info,
+ struct btrfs_device *device,
+ struct block_device *bdev);
+void btrfs_r5l_init_log_post(struct btrfs_fs_info *fs_info,
+struct btrfs_r5l_log *log);
 int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device 
*device);
 int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp);
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ac64d93..851c001 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2327,6 +2327,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
int seeding_dev = 0;
int ret = 0;
bool is_r5log = (flags & BTRFS_DEVICE_RAID56_LOG);
+   struct btrfs_r5l_log *r5log = NULL;
 
if (is_r5log)
ASSERT(!fs_info->fs_devices->seeding);
@@ -2367,6 +2368,15 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
goto error;
}
 
+

[PATCH 10/14] Btrfs: raid56: use the readahead helper to get page

2017-08-01 Thread Liu Bo

This updates recovery code to use the readahead helper.

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 24f7cbb..8f47e56 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1608,7 +1608,9 @@ static int btrfs_r5l_recover_load_meta(struct 
btrfs_r5l_recover_ctx *ctx)
 {
struct btrfs_r5l_meta_block *mb;
 
-   btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos >> 9), PAGE_SIZE, 
ctx->meta_page, REQ_OP_READ);
+   ret = btrfs_r5l_recover_read_page(ctx, ctx->meta_page, ctx->pos);
+   if (ret)
+   return ret;
 
mb = kmap(ctx->meta_page);
 #ifdef BTRFS_DEBUG_R5LOG
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/14] Btrfs: raid56: add reclaim support

2017-08-01 Thread Liu Bo

The log space is limited, so reclaim is necessary when there is not enough 
space to use.

By recording the largest position we've written to the log disk and
flushing all disks' cache and the superblock, we can be sure that data
and parity before this position have the identical copy in the log and
raid5/6 array.

Also we need to take care of the case when IOs get reordered.  A list
is used to keep the order right.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h   | 10 +++-
 fs/btrfs/raid56.c  | 63 --
 fs/btrfs/transaction.c |  2 ++
 3 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d967627..9235643 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -244,8 +244,10 @@ struct btrfs_super_block {
__le64 cache_generation;
__le64 uuid_tree_generation;
 
+   /* r5log journal tail (where recovery starts) */
+   __le64 journal_tail;
/* future expansion */
-   __le64 reserved[30];
+   __le64 reserved[29];
u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 } __attribute__ ((__packed__));
@@ -2291,6 +2293,8 @@ BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct 
btrfs_super_block,
 log_root_transid, 64);
 BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block,
 log_root_level, 8);
+BTRFS_SETGET_STACK_FUNCS(super_journal_tail, struct btrfs_super_block,
+journal_tail, 64);
 BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block,
 total_bytes, 64);
 BTRFS_SETGET_STACK_FUNCS(super_bytes_used, struct btrfs_super_block,
@@ -3284,6 +3288,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
unsigned long new_flags);
 int btrfs_sync_fs(struct super_block *sb, int wait);
 
+/* raid56.c */
+void btrfs_r5l_write_journal_tail(struct btrfs_fs_info *fs_info);
+
+
 static inline __printf(2, 3)
 void btrfs_no_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...)
 {
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 007ba63..60010a6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -191,6 +191,8 @@ struct btrfs_r5l_log {
u64 data_offset;
u64 device_size;
 
+   u64 next_checkpoint;
+
u64 last_checkpoint;
u64 last_cp_seq;
u64 seq;
@@ -1231,11 +1233,14 @@ static void btrfs_r5l_log_endio(struct bio *bio)
bio_put(bio);
 
 #ifdef BTRFS_DEBUG_R5LOG
-   trace_printk("move data to disk\n");
+   trace_printk("move data to disk(current log->next_checkpoint %llu (will 
be %llu after writing to RAID\n", log->next_checkpoint, io->log_start);
 #endif
/* move data to RAID. */
btrfs_write_rbio(io->rbio);
 
+   /* After stripe data has been flushed into raid, set ->next_checkpoint. 
*/
+   log->next_checkpoint = io->log_start;
+
if (log->current_io == io)
log->current_io = NULL;
btrfs_r5l_free_io_unit(log, io);
@@ -1473,6 +1478,42 @@ static bool btrfs_r5l_has_free_space(struct 
btrfs_r5l_log *log, u64 size)
 }
 
 /*
+ * writing super with log->next_checkpoint
+ *
+ * This is protected by log->io_mutex.
+ */
+static void btrfs_r5l_write_super(struct btrfs_fs_info *fs_info, u64 cp)
+{
+   int ret;
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("r5l writing super to reclaim space, cp %llu\n", cp);
+#endif
+
+   btrfs_set_super_journal_tail(fs_info->super_for_commit, cp);
+
+   /*
+* flush all disk cache so that all data prior to
+* %next_checkpoint lands on raid disks(recovery will start
+* from %next_checkpoint).
+*/
+   ret = write_all_supers(fs_info, 1);
+   ASSERT(ret == 0);
+}
+
+/* this is called by commit transaction and it's followed by writing super. */
+void btrfs_r5l_write_journal_tail(struct btrfs_fs_info *fs_info)
+{
+   if (fs_info->r5log) {
+   u64 cp = READ_ONCE(fs_info->r5log->next_checkpoint);
+
+   trace_printk("journal_tail %llu\n", cp);
+   btrfs_set_super_journal_tail(fs_info->super_copy, cp);
+   WRITE_ONCE(fs_info->r5log->last_checkpoint, cp);
+   }
+}
+
+/*
  * return 0 if data/parity are written into log and it will move data
  * to RAID in endio.
  *
@@ -1535,7 +1576,25 @@ static int btrfs_r5l_write_stripe(struct btrfs_raid_bio 
*rbio)
btrfs_r5l_log_stripe(log, data_pages, parity_pages, rbio);
do_submit = true;
} else {
-   ; /* XXX: reclaim */
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("r5log: no space log->last_checkpoint %llu 
log->log_start %llu log->next_checkpoint %llu\n", log->last_checkpoint, 
log->log_start, log->next_checkpoint);
+#endif
+
+   /*
+*

[PATCH 01/14] Btrfs: raid56: add raid56 log via add_dev v2 ioctl

2017-08-01 Thread Liu Bo

This introduces add_dev_v2 ioctl to add a device as raid56 journal
device.  With the help of a journal device, raid56 is able to to get
rid of potential write holes.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h|  6 ++
 fs/btrfs/ioctl.c| 48 -
 fs/btrfs/raid56.c   | 42 
 fs/btrfs/raid56.h   |  1 +
 fs/btrfs/volumes.c  | 26 --
 fs/btrfs/volumes.h  |  3 ++-
 include/uapi/linux/btrfs.h  |  3 +++
 include/uapi/linux/btrfs_tree.h |  4 
 8 files changed, 125 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 643c70d..d967627 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -697,6 +697,7 @@ struct btrfs_stripe_hash_table {
 void btrfs_init_async_reclaim_work(struct work_struct *work);
 
 /* fs_info */
+struct btrfs_r5l_log;
 struct reloc_control;
 struct btrfs_device;
 struct btrfs_fs_devices;
@@ -1114,6 +1115,9 @@ struct btrfs_fs_info {
u32 nodesize;
u32 sectorsize;
u32 stripesize;
+
+   /* raid56 log */
+   struct btrfs_r5l_log *r5log;
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
@@ -2932,6 +2936,8 @@ static inline int btrfs_need_cleaner_sleep(struct 
btrfs_fs_info *fs_info)
 
 static inline void free_fs_info(struct btrfs_fs_info *fs_info)
 {
+   if (fs_info->r5log)
+   kfree(fs_info->r5log);
kfree(fs_info->balance_ctl);
kfree(fs_info->delayed_root);
kfree(fs_info->extent_root);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e176375..3d1ef4d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2653,6 +2653,50 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
return ret;
 }
 
+/* identical to btrfs_ioctl_add_dev, but this is with flags */
+static long btrfs_ioctl_add_dev_v2(struct btrfs_fs_info *fs_info, void __user 
*arg)
+{
+   struct btrfs_ioctl_vol_args_v2 *vol_args;
+   int ret;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
+   return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
+
+   mutex_lock(_info->volume_mutex);
+   vol_args = memdup_user(arg, sizeof(*vol_args));
+   if (IS_ERR(vol_args)) {
+   ret = PTR_ERR(vol_args);
+   goto out;
+   }
+
+   if (vol_args->flags & BTRFS_DEVICE_RAID56_LOG &&
+   fs_info->r5log) {
+   ret = -EEXIST;
+   btrfs_info(fs_info, "r5log: attempting to add another log 
device!");
+   goto out_free;
+   }
+
+   vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
+   ret = btrfs_init_new_device(fs_info, vol_args->name, vol_args->flags);
+   if (!ret) {
+   if (vol_args->flags & BTRFS_DEVICE_RAID56_LOG) {
+   ASSERT(fs_info->r5log);
+   btrfs_info(fs_info, "disk added %s as raid56 log", 
vol_args->name);
+   } else {
+   btrfs_info(fs_info, "disk added %s", vol_args->name);
+   }
+   }
+out_free:
+   kfree(vol_args);
+out:
+   mutex_unlock(_info->volume_mutex);
+   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
+   return ret;
+}
+
 static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user 
*arg)
 {
struct btrfs_ioctl_vol_args *vol_args;
@@ -2672,7 +2716,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info 
*fs_info, void __user *arg)
}
 
vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
-   ret = btrfs_init_new_device(fs_info, vol_args->name);
+   ret = btrfs_init_new_device(fs_info, vol_args->name, 0);
 
if (!ret)
btrfs_info(fs_info, "disk added %s", vol_args->name);
@@ -5539,6 +5583,8 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_resize(file, argp);
case BTRFS_IOC_ADD_DEV:
return btrfs_ioctl_add_dev(fs_info, argp);
+   case BTRFS_IOC_ADD_DEV_V2:
+   return btrfs_ioctl_add_dev_v2(fs_info, argp);
case BTRFS_IOC_RM_DEV:
return btrfs_ioctl_rm_dev(file, argp);
case BTRFS_IOC_RM_DEV_V2:
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d8ea0eb..2b91b95 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -177,6 +177,25 @@ struct btrfs_raid_bio {
unsigned long *dbitmap;
 };
 
+/* raid56 log */
+struct btrfs_r5l_log {
+   /* protect this struct and log io */
+   struct mutex io_mutex;
+
+   /* r5log device */
+   struct btrfs_device *dev;
+
+   /* allocation range for log entries */
+   u64 data_offset;
+   u64 device_size;
+
+   u64 last_checkpoint;
+   u64 last_cp_seq;
+   u64 seq;
+   u64 log_start;
+   struct btrfs_r5l_io_unit *current_io;

[PATCH 09/14] Btrfs: raid56: add readahead for recovery

2017-08-01 Thread Liu Bo

While doing recovery, blocks are read from the raid5/6 disk one by
one, so this is adding readahead so that we can read at most 256
contiguous blocks in one read IO.

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c | 114 +++---
 1 file changed, 109 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index dea33c4..24f7cbb 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1530,15 +1530,81 @@ static int btrfs_r5l_write_empty_meta_block(struct 
btrfs_r5l_log *log, u64 pos,
return ret;
 }
 
+#define BTRFS_R5L_RECOVER_IO_POOL_SIZE BIO_MAX_PAGES
 struct btrfs_r5l_recover_ctx {
u64 pos;
u64 seq;
u64 total_size;
struct page *meta_page;
struct page *io_page;
+
+   struct page *ra_pages[BTRFS_R5L_RECOVER_IO_POOL_SIZE];
+   struct bio *ra_bio;
+   int total;
+   int valid;
+   u64 start_offset;
+
+   struct btrfs_r5l_log *log;
 };
 
-static int btrfs_r5l_recover_load_meta(struct btrfs_r5l_log *log, struct 
btrfs_r5l_recover_ctx *ctx)
+static int btrfs_r5l_recover_read_ra(struct btrfs_r5l_recover_ctx *ctx, u64 
offset)
+{
+   bio_reset(ctx->ra_bio);
+   ctx->ra_bio->bi_bdev = ctx->log->dev->bdev;
+   ctx->ra_bio->bi_opf = REQ_OP_READ;
+   ctx->ra_bio->bi_iter.bi_sector = (ctx->log->data_offset + offset) >> 9;
+
+   ctx->valid = 0;
+   ctx->start_offset = offset;
+
+   while (ctx->valid < ctx->total) {
+   bio_add_page(ctx->ra_bio, ctx->ra_pages[ctx->valid++], 
PAGE_SIZE, 0);
+
+   offset = btrfs_r5l_ring_add(ctx->log, offset, PAGE_SIZE);
+   if (offset == 0)
+   break;
+   }
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("to read %d pages starting from 0x%llx\n", ctx->valid, 
ctx->log->data_offset + ctx->start_offset);
+#endif
+   return submit_bio_wait(ctx->ra_bio);
+}
+
+static int btrfs_r5l_recover_read_page(struct btrfs_r5l_recover_ctx *ctx, 
struct page *page, u64 offset)
+{
+   struct page *tmp;
+   int index;
+   char *src;
+   char *dst;
+   int ret;
+
+   if (offset < ctx->start_offset || offset >= (ctx->start_offset + 
ctx->valid * PAGE_SIZE)) {
+   ret = btrfs_r5l_recover_read_ra(ctx, offset);
+   if (ret)
+   return ret;
+   }
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("offset 0x%llx start->offset 0x%llx ctx->valid %d\n", 
offset, ctx->start_offset, ctx->valid);
+#endif
+
+   ASSERT(IS_ALIGNED(ctx->start_offset, PAGE_SIZE));
+   ASSERT(IS_ALIGNED(offset, PAGE_SIZE));
+
+   index = (offset - ctx->start_offset) >> PAGE_SHIFT;
+   ASSERT(index < ctx->valid);
+
+   tmp = ctx->ra_pages[index];
+   src = kmap(tmp);
+   dst = kmap(page);
+   memcpy(dst, src, PAGE_SIZE);
+   kunmap(page);
+   kunmap(tmp);
+   return 0;
+}
+
+static int btrfs_r5l_recover_load_meta(struct btrfs_r5l_recover_ctx *ctx)
 {
struct btrfs_r5l_meta_block *mb;
 
@@ -1642,6 +1708,42 @@ static int btrfs_r5l_recover_flush_log(struct 
btrfs_r5l_log *log, struct btrfs_r
}
 
return ret;
+
+static int btrfs_r5l_recover_allocate_ra(struct btrfs_r5l_recover_ctx *ctx)
+{
+   struct page *page;
+   ctx->ra_bio = btrfs_io_bio_alloc(GFP_NOFS, BIO_MAX_PAGES);
+
+   ctx->total = 0;
+   ctx->valid = 0;
+   while (ctx->total < BTRFS_R5L_RECOVER_IO_POOL_SIZE) {
+   page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+   if (!page)
+   break;
+
+   ctx->ra_pages[ctx->total++] = page;
+   }
+
+   if (ctx->total == 0) {
+   bio_put(ctx->ra_bio);
+   return -ENOMEM;
+   }
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("readahead: %d allocated pages\n", ctx->total);
+#endif
+   return 0;
+}
+
+static void btrfs_r5l_recover_free_ra(struct btrfs_r5l_recover_ctx *ctx)
+{
+   int i;
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("readahead: %d to free pages\n", ctx->total);
+#endif
+   for (i = 0; i < ctx->total; i++)
+   __free_page(ctx->ra_pages[i]);
+   bio_put(ctx->ra_bio);
 }
 
 static void btrfs_r5l_write_super(struct btrfs_fs_info *fs_info, u64 cp);
@@ -1655,6 +1757,7 @@ static int btrfs_r5l_recover_log(struct btrfs_r5l_log 
*log)
ctx = kzalloc(sizeof(*ctx), GFP_NOFS);
ASSERT(ctx);
 
+   ctx->log = log;
ctx->pos = log->last_checkpoint;
ctx->seq = log->last_cp_seq;
ctx->meta_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
@@ -1662,10 +1765,10 @@ static int btrfs_r5l_recover_log(struct btrfs_r5l_log 
*log)
ctx->io_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
ASSERT(ctx->io_page);
 
-   ret = btrfs_r5l_recover_flush_log(log, ctx);
-   if (ret) {
-   ;
-   }
+   ret = btrfs_r5l_recover_allocate_ra(ctx);
+   ASSERT(ret == 0);
+
+

[PATCH 14/14] Btrfs: raid56: maintain IO order on raid5/6 log

2017-08-01 Thread Liu Bo

A typical write to the raid5/6 log needs three steps:

1) collect data/parity pages into the bio in io_unit;
2) submit the bio in io_unit;
3) writeback data/parity to raid array in end_io.

1) and 2) are protected within log->io_mutex, while 3) is not.

Since recovery needs to know the checkpoint offset where the highest
successful writeback is, we cannot allow IO to be reordered.  This is
adding a list in which IO order is maintained properly.

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c | 42 ++
 fs/btrfs/raid56.h |  5 +
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index b771d7d..ceca415 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -183,6 +183,9 @@ struct btrfs_r5l_log {
/* protect this struct and log io */
struct mutex io_mutex;
 
+   spinlock_t io_list_lock;
+   struct list_head io_list;
+
/* r5log device */
struct btrfs_device *dev;
 
@@ -1205,6 +1208,7 @@ static struct btrfs_r5l_io_unit 
*btrfs_r5l_alloc_io_unit(struct btrfs_r5l_log *l
 static void btrfs_r5l_free_io_unit(struct btrfs_r5l_log *log, struct 
btrfs_r5l_io_unit *io)
 {
__free_page(io->meta_page);
+   ASSERT(list_empty(>list));
kfree(io);
 }
 
@@ -1225,6 +1229,27 @@ static void btrfs_r5l_reserve_log_entry(struct 
btrfs_r5l_log *log, struct btrfs_
io->need_split_bio = true;
 }
 
+/* the IO order is maintained in log->io_list. */
+static void btrfs_r5l_finish_io(struct btrfs_r5l_log *log)
+{
+   struct btrfs_r5l_io_unit *io, *next;
+
+   spin_lock(>io_list_lock);
+   list_for_each_entry_safe(io, next, >io_list, list) {
+   if (io->status != BTRFS_R5L_STRIPE_END)
+   break;
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("current log->next_checkpoint %llu (will be %llu after 
writing to RAID\n", log->next_checkpoint, io->log_start);
+#endif
+
+   list_del_init(>list);
+   log->next_checkpoint = io->log_start;
+   btrfs_r5l_free_io_unit(log, io);
+   }
+   spin_unlock(>io_list_lock);
+}
+
 static void btrfs_write_rbio(struct btrfs_raid_bio *rbio);
 
 static void btrfs_r5l_log_endio(struct bio *bio)
@@ -1234,18 +1259,12 @@ static void btrfs_r5l_log_endio(struct bio *bio)
 
bio_put(bio);
 
-#ifdef BTRFS_DEBUG_R5LOG
-   trace_printk("move data to disk(current log->next_checkpoint %llu (will 
be %llu after writing to RAID\n", log->next_checkpoint, io->log_start);
-#endif
/* move data to RAID. */
btrfs_write_rbio(io->rbio);
 
+   io->status = BTRFS_R5L_STRIPE_END;
/* After stripe data has been flushed into raid, set ->next_checkpoint. 
*/
-   log->next_checkpoint = io->log_start;
-
-   if (log->current_io == io)
-   log->current_io = NULL;
-   btrfs_r5l_free_io_unit(log, io);
+   btrfs_r5l_finish_io(log);
 }
 
 static struct bio *btrfs_r5l_bio_alloc(struct btrfs_r5l_log *log)
@@ -1299,6 +1318,11 @@ static struct btrfs_r5l_io_unit 
*btrfs_r5l_new_meta(struct btrfs_r5l_log *log)
bio_add_page(io->current_bio, io->meta_page, PAGE_SIZE, 0);
 
btrfs_r5l_reserve_log_entry(log, io);
+
+   INIT_LIST_HEAD(>list);
+   spin_lock(>io_list_lock);
+   list_add_tail(>list, >io_list);
+   spin_unlock(>io_list_lock);
return io;
 }
 
@@ -3760,6 +3784,8 @@ struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct 
btrfs_fs_info *fs_info,
ASSERT(sizeof(device->uuid) == BTRFS_UUID_SIZE);
log->uuid_csum = btrfs_crc32c(~0, device->uuid, sizeof(device->uuid));
mutex_init(>io_mutex);
+   spin_lock_init(>io_list_lock);
+   INIT_LIST_HEAD(>io_list);
 
return log;
 }
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 2cc64a3..fc4ff20 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -43,11 +43,16 @@ static inline int nr_data_stripes(struct map_lookup *map)
 struct btrfs_r5l_log;
 #define BTRFS_R5LOG_MAGIC 0x6433c509
 
+#define BTRFS_R5L_STRIPE_END 1
+
 /* one meta block + several data + parity blocks */
 struct btrfs_r5l_io_unit {
struct btrfs_r5l_log *log;
struct btrfs_raid_bio *rbio;
 
+   struct list_head list;
+   int status;
+
/* store meta block */
struct page *meta_page;
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] Btrfs-progs: add option to add raid5/6 log device

2017-08-01 Thread Liu Bo

This introduces an option for 'btrfs device add' to add a device as
raid5/6 log at run time.

Signed-off-by: Liu Bo 
---
 cmds-device.c | 30 +-
 ioctl.h   |  3 +++
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/cmds-device.c b/cmds-device.c
index 4337eb2..ec6037e 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -45,6 +45,7 @@ static const char * const cmd_device_add_usage[] = {
"Add a device to a filesystem",
"-K|--nodiscarddo not perform whole device TRIM",
"-f|--forceforce overwrite existing filesystem on the disk",
+   "-L|--r5logadd a disk as raid56 log",
NULL
 };
 
@@ -55,6 +56,7 @@ static int cmd_device_add(int argc, char **argv)
DIR *dirstream = NULL;
int discard = 1;
int force = 0;
+   int for_r5log = 0;
int last_dev;
 
while (1) {
@@ -62,10 +64,11 @@ static int cmd_device_add(int argc, char **argv)
static const struct option long_options[] = {
{ "nodiscard", optional_argument, NULL, 'K'},
{ "force", no_argument, NULL, 'f'},
+   { "r5log", no_argument, NULL, 'L'},
{ NULL, 0, NULL, 0}
};
 
-   c = getopt_long(argc, argv, "Kf", long_options, NULL);
+   c = getopt_long(argc, argv, "KfL", long_options, NULL);
if (c < 0)
break;
switch (c) {
@@ -75,6 +78,9 @@ static int cmd_device_add(int argc, char **argv)
case 'f':
force = 1;
break;
+   case 'L':
+   for_r5log = 1;
+   break;
default:
usage(cmd_device_add_usage);
}
@@ -83,6 +89,9 @@ static int cmd_device_add(int argc, char **argv)
if (check_argc_min(argc - optind, 2))
usage(cmd_device_add_usage);
 
+   if (for_r5log && check_argc_max(argc - optind, 2))
+   usage(cmd_device_add_usage);
+
last_dev = argc - 1;
mntpnt = argv[last_dev];
 
@@ -91,7 +100,6 @@ static int cmd_device_add(int argc, char **argv)
return 1;
 
for (i = optind; i < last_dev; i++){
-   struct btrfs_ioctl_vol_args ioctl_args;
int devfd, res;
u64 dev_block_count = 0;
char *path;
@@ -126,9 +134,21 @@ static int cmd_device_add(int argc, char **argv)
goto error_out;
}
 
-   memset(_args, 0, sizeof(ioctl_args));
-   strncpy_null(ioctl_args.name, path);
-   res = ioctl(fdmnt, BTRFS_IOC_ADD_DEV, _args);
+   if (!for_r5log) {
+   struct btrfs_ioctl_vol_args ioctl_args;
+
+   memset(_args, 0, sizeof(ioctl_args));
+   strncpy_null(ioctl_args.name, path);
+   res = ioctl(fdmnt, BTRFS_IOC_ADD_DEV, _args);
+   } else {
+   /* apply v2 args format */
+   struct btrfs_ioctl_vol_args_v2 ioctl_args;
+
+   memset(_args, 0, sizeof(ioctl_args));
+   strncpy_null(ioctl_args.name, path);
+   ioctl_args.flags |= BTRFS_DEVICE_RAID56_LOG;
+   res = ioctl(fdmnt, BTRFS_IOC_ADD_DEV_V2, _args);
+   }
if (res < 0) {
error("error adding device '%s': %s",
path, strerror(errno));
diff --git a/ioctl.h b/ioctl.h
index 709e996..748a7af 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -53,6 +53,7 @@ BUILD_ASSERT(sizeof(struct btrfs_ioctl_vol_args) == 4096);
 #define BTRFS_SUBVOL_RDONLY(1ULL << 1)
 #define BTRFS_SUBVOL_QGROUP_INHERIT(1ULL << 2)
 #define BTRFS_DEVICE_SPEC_BY_ID(1ULL << 3)
+#define BTRFS_DEVICE_RAID56_LOG(1ULL << 4)
 
 #define BTRFS_VOL_ARG_V2_FLAGS_SUPPORTED   \
(BTRFS_SUBVOL_CREATE_ASYNC |\
@@ -828,6 +829,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code 
err_code)
   struct btrfs_ioctl_feature_flags[3])
 #define BTRFS_IOC_RM_DEV_V2_IOW(BTRFS_IOCTL_MAGIC, 58, \
   struct btrfs_ioctl_vol_args_v2)
+#define BTRFS_IOC_ADD_DEV_V2 _IOW(BTRFS_IOCTL_MAGIC, 59, \
+  struct btrfs_ioctl_vol_args_v2)
 #ifdef __cplusplus
 }
 #endif
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/14] Btrfs: raid56: load r5log

2017-08-01 Thread Liu Bo

A raid5/6 log can be loaded while mounting a btrfs (which already has
a disk set up as raid5/6 log) or setting up a disk as raid5/6 log for
the first time.

It gets %journal_tail from super_block where it can read the first 4K
block and goes through the sanity checks, if it's valid, then go check
if anything needs to be replayed, otherwise it creates a new empty
block at the beginning of the disk and new writes will append to it.

Signed-off-by: Liu Bo 
---
 fs/btrfs/disk-io.c |  16 +++
 fs/btrfs/raid56.c  | 128 +
 fs/btrfs/raid56.h  |   1 +
 3 files changed, 145 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8685d67..c2d8697 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2987,6 +2987,22 @@ int open_ctree(struct super_block *sb,
fs_info->generation = generation;
fs_info->last_trans_committed = generation;
 
+   if (fs_info->r5log) {
+   u64 cp = btrfs_super_journal_tail(fs_info->super_copy);
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("%s: get journal_tail %llu\n", __func__, cp);
+#endif
+   /* if the data is not replayed, data and parity on
+* disk are still consistent.  So we can move on.
+*
+* About fsync, since fsync can make sure data is
+* flushed onto disk and only metadata is kept into
+* write-ahead log, the fsync'd data will never ends
+* up with being replayed by raid56 log.
+*/
+   btrfs_r5l_load_log(fs_info, cp);
+   }
+
ret = btrfs_recover_balance(fs_info);
if (ret) {
btrfs_err(fs_info, "failed to recover balance: %d", ret);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 60010a6..5d7ea235 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1477,6 +1477,134 @@ static bool btrfs_r5l_has_free_space(struct 
btrfs_r5l_log *log, u64 size)
return log->device_size > (used_size + size);
 }
 
+static int btrfs_r5l_sync_page_io(struct btrfs_r5l_log *log,
+ struct btrfs_device *dev, sector_t sector,
+ int size, struct page *page, int op)
+{
+   struct bio *bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
+   int ret;
+
+   bio->bi_bdev = dev->bdev;
+   bio->bi_opf = op;
+   if (dev == log->dev)
+   bio->bi_iter.bi_sector = (log->data_offset >> 9) + sector;
+   else
+   bio->bi_iter.bi_sector = sector;
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("%s: op %d bi_sector 0x%llx\n", __func__, op, 
(bio->bi_iter.bi_sector << 9));
+#endif
+
+   bio_add_page(bio, page, size, 0);
+   submit_bio_wait(bio);
+   ret = !bio->bi_error;
+   bio_put(bio);
+   return ret;
+}
+
+static int btrfs_r5l_write_empty_meta_block(struct btrfs_r5l_log *log, u64 
pos, u64 seq)
+{
+   struct page *page;
+   struct btrfs_r5l_meta_block *mb;
+   int ret = 0;
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("%s: pos %llu seq %llu\n", __func__, pos, seq);
+#endif
+
+   page = alloc_page(GFP_NOFS | __GFP_HIGHMEM | __GFP_ZERO);
+   ASSERT(page);
+
+   mb = kmap(page);
+   mb->magic = cpu_to_le32(BTRFS_R5LOG_MAGIC);
+   mb->meta_size = cpu_to_le32(sizeof(struct btrfs_r5l_meta_block));
+   mb->seq = cpu_to_le64(seq);
+   mb->position = cpu_to_le64(pos);
+   kunmap(page);
+
+   if (!btrfs_r5l_sync_page_io(log, log->dev, (pos >> 9), PAGE_SIZE, page, 
REQ_OP_WRITE | REQ_FUA)) {
+   ret = -EIO;
+   }
+
+   __free_page(page);
+   return ret;
+}
+
+static void btrfs_r5l_write_super(struct btrfs_fs_info *fs_info, u64 cp);
+
+static int btrfs_r5l_recover_log(struct btrfs_r5l_log *log)
+{
+   return 0;
+}
+
+/* return 0 if success, otherwise return errors */
+int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp)
+{
+   struct btrfs_r5l_log *log = fs_info->r5log;
+   struct page *page;
+   struct btrfs_r5l_meta_block *mb;
+   bool create_new = false;
+
+   ASSERT(log);
+
+   page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+   ASSERT(page);
+
+   if (!btrfs_r5l_sync_page_io(log, log->dev, (cp >> 9), PAGE_SIZE, page,
+   REQ_OP_READ)) {
+   __free_page(page);
+   return -EIO;
+   }
+
+   mb = kmap(page);
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("r5l: mb->pos %llu cp %llu mb->seq %llu\n", 
le64_to_cpu(mb->position), cp, le64_to_cpu(mb->seq));
+#endif
+
+   if (le32_to_cpu(mb->magic) != BTRFS_R5LOG_MAGIC) {
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("magic not match: create new r5l\n");
+#endif
+   create_new = true;
+   goto create;
+   }
+
+   ASSERT(le64_to_cpu(mb->position) == cp);
+   if (le64_to_cpu(mb->position) != cp) {
+#ifdef

[PATCH 05/14] Btrfs: raid56: add stripe log for raid5/6

2017-08-01 Thread Liu Bo

This is adding the ability to use a disk as raid5/6's stripe log (aka
journal), the primary goal is to fix write hole issue that is inherent
in raid56 setup.

In a typical raid5/6 setup, both full stripe write and a partial
stripe write will generate parity at the very end of writing, so after
parity is generated, it's the right time to issue writes.

Now with raid5/6's stripe log, every write will be put into the stripe
log prior to being written to raid5/6 array, so that we have
everything to rewrite all 'not-yet-on-disk' data/parity if a power
loss happens while writing data/parity to different disks in raid5/6
array.

A metadata block is used here to manage the information of data and
parity and it's placed ahead of data and parity on stripe log.  Right
now such metadata block is limited to one page size and the structure
is defined as

{metadata block} + {a few payloads}

- 'metadata block' contains a magic code, a sequence number and the
  start position on the stripe log.

- 'payload' contains the information about data and parity, e.g. the
 physical offset and device id where data/parity is supposed to be.

Each data block has a payload while each set of parity has a payload
(e.g. for raid6, parity p and q has their own payload respectively).

And we treat data and parity differently because btrfs always prepares
the whole stripe length(64k) of parity, but data may only come from a
partial stripe write.

This metadata block is written to the raid5/6 stripe log with data/parity
in a single bio(could be two bios, doesn't support more than two
bios).

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c | 512 +++---
 fs/btrfs/raid56.h |  65 +++
 2 files changed, 513 insertions(+), 64 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c75766f..007ba63 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -185,6 +185,8 @@ struct btrfs_r5l_log {
/* r5log device */
struct btrfs_device *dev;
 
+   struct btrfs_fs_info *fs_info;
+
/* allocation range for log entries */
u64 data_offset;
u64 device_size;
@@ -1179,6 +1181,445 @@ static void index_rbio_pages(struct btrfs_raid_bio 
*rbio)
spin_unlock_irq(>bio_list_lock);
 }
 
+/* r5log */
+/* XXX: this allocation may be done earlier, eg. when allocating rbio */
+static struct btrfs_r5l_io_unit *btrfs_r5l_alloc_io_unit(struct btrfs_r5l_log 
*log)
+{
+   struct btrfs_r5l_io_unit *io;
+   gfp_t gfp = GFP_NOFS;
+
+   io = kzalloc(sizeof(*io), gfp);
+   ASSERT(io);
+   io->log = log;
+   /* need to use kmap. */
+   io->meta_page = alloc_page(gfp | __GFP_HIGHMEM | __GFP_ZERO);
+   ASSERT(io->meta_page);
+
+   return io;
+}
+
+static void btrfs_r5l_free_io_unit(struct btrfs_r5l_log *log, struct 
btrfs_r5l_io_unit *io)
+{
+   __free_page(io->meta_page);
+   kfree(io);
+}
+
+static u64 btrfs_r5l_ring_add(struct btrfs_r5l_log *log, u64 start, u64 inc)
+{
+   start += inc;
+   if (start >= log->device_size)
+   start = start - log->device_size;
+   return start;
+}
+
+static void btrfs_r5l_reserve_log_entry(struct btrfs_r5l_log *log, struct 
btrfs_r5l_io_unit *io)
+{
+   log->log_start = btrfs_r5l_ring_add(log, log->log_start, PAGE_SIZE);
+   io->log_end = log->log_start;
+
+   if (log->log_start == 0)
+   io->need_split_bio = true;
+}
+
+static void btrfs_write_rbio(struct btrfs_raid_bio *rbio);
+
+static void btrfs_r5l_log_endio(struct bio *bio)
+{
+   struct btrfs_r5l_io_unit *io = bio->bi_private;
+   struct btrfs_r5l_log *log = io->log;
+
+   bio_put(bio);
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("move data to disk\n");
+#endif
+   /* move data to RAID. */
+   btrfs_write_rbio(io->rbio);
+
+   if (log->current_io == io)
+   log->current_io = NULL;
+   btrfs_r5l_free_io_unit(log, io);
+}
+
+static struct bio *btrfs_r5l_bio_alloc(struct btrfs_r5l_log *log)
+{
+   /* this allocation will not fail. */
+   struct bio *bio = btrfs_io_bio_alloc(GFP_NOFS, BIO_MAX_PAGES);
+
+   /* We need to make sure data/parity are settled down on the log disk. */
+   bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA;
+   bio->bi_bdev = log->dev->bdev;
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("log->data_offset 0x%llx log->log_start 0x%llx\n", 
log->data_offset, log->log_start);
+#endif
+   bio->bi_iter.bi_sector = (log->data_offset + log->log_start) >> 9;
+
+   return bio;
+}
+
+static struct btrfs_r5l_io_unit *btrfs_r5l_new_meta(struct btrfs_r5l_log *log)
+{
+   struct btrfs_r5l_io_unit *io;
+   struct btrfs_r5l_meta_block *block;
+
+   io = btrfs_r5l_alloc_io_unit(log);
+   ASSERT(io);
+
+   block = kmap(io->meta_page);
+   clear_page(block);
+
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("%s pos %llu seq %llu\n", __func__, log->log_start,

[PATCH 08/14] Btrfs: raid56: log recovery

2017-08-01 Thread Liu Bo

This is adding recovery on raid5/6 log.

We've set a %journal_tail in super_block, which indicates the position
from where we need to replay data.  So we scan the log and replay
valid meta/data/parity pairs until finding an invalid one.  By
replaying, it simply reads data/parity from the raid5/6 log and issues
writes to the raid disks where it should be.  Please note that the
whole meta/data/parity pair can be discarded if it fails the sanity
check in the meta block.

After recovery, we also append an empty meta block and update the
%journal_tail in super_block in order to avoid a situation, where the
layout on the raid5/6 log is

[valid A][invalid B][valid C],

so block A is the only one we should replay.

Then the recovery ends up pointing to block A as block B is invalid,
and some new writes come in and append to block A so that block B is
now overwritten to be a valid meta/data/parity.  If a power loss
happens, the new recovery starts again from block A, and since block B
is now valid, it may replay block C as well which has become stale.

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c | 151 ++
 1 file changed, 151 insertions(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 5d7ea235..dea33c4 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1530,10 +1530,161 @@ static int btrfs_r5l_write_empty_meta_block(struct 
btrfs_r5l_log *log, u64 pos,
return ret;
 }
 
+struct btrfs_r5l_recover_ctx {
+   u64 pos;
+   u64 seq;
+   u64 total_size;
+   struct page *meta_page;
+   struct page *io_page;
+};
+
+static int btrfs_r5l_recover_load_meta(struct btrfs_r5l_log *log, struct 
btrfs_r5l_recover_ctx *ctx)
+{
+   struct btrfs_r5l_meta_block *mb;
+
+   btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos >> 9), PAGE_SIZE, 
ctx->meta_page, REQ_OP_READ);
+
+   mb = kmap(ctx->meta_page);
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("ctx->pos %llu ctx->seq %llu pos %llu seq %llu\n", 
ctx->pos, ctx->seq, le64_to_cpu(mb->position), le64_to_cpu(mb->seq));
+#endif
+
+   if (le32_to_cpu(mb->magic) != BTRFS_R5LOG_MAGIC ||
+   le64_to_cpu(mb->position) != ctx->pos ||
+   le64_to_cpu(mb->seq) != ctx->seq) {
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("%s: mismatch magic %llu default %llu\n", 
__func__, le32_to_cpu(mb->magic), BTRFS_R5LOG_MAGIC);
+#endif
+   return -EINVAL;
+   }
+
+   ASSERT(le32_to_cpu(mb->meta_size) <= PAGE_SIZE);
+   kunmap(ctx->meta_page);
+
+   /* meta_block */
+   ctx->total_size = PAGE_SIZE;
+
+   return 0;
+}
+
+static int btrfs_r5l_recover_load_data(struct btrfs_r5l_log *log, struct 
btrfs_r5l_recover_ctx *ctx)
+{
+   u64 offset;
+   struct btrfs_r5l_meta_block *mb;
+   u64 meta_size;
+   u64 io_offset;
+   struct btrfs_device *dev;
+
+   mb = kmap(ctx->meta_page);
+
+   io_offset = PAGE_SIZE;
+   offset = sizeof(struct btrfs_r5l_meta_block);
+   meta_size = le32_to_cpu(mb->meta_size);
+
+   while (offset < meta_size) {
+   struct btrfs_r5l_payload *payload = (void *)mb + offset;
+
+   /* read data from log disk and write to payload->location */
+#ifdef BTRFS_DEBUG_R5LOG
+   trace_printk("payload type %d flags %d size %d location 0x%llx 
devid %llu\n", le16_to_cpu(payload->type), le16_to_cpu(payload->flags), 
le32_to_cpu(payload->size), le64_to_cpu(payload->location), 
le64_to_cpu(payload->devid));
+#endif
+
+   dev = btrfs_find_device(log->fs_info, 
le64_to_cpu(payload->devid), NULL, NULL);
+   if (!dev || dev->missing) {
+   ASSERT(0);
+   }
+
+   if (le16_to_cpu(payload->type) == R5LOG_PAYLOAD_DATA) {
+   ASSERT(le32_to_cpu(payload->size) == 1);
+   btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos + 
io_offset) >> 9, PAGE_SIZE, ctx->io_page, REQ_OP_READ);
+   btrfs_r5l_sync_page_io(log, dev, 
le64_to_cpu(payload->location) >> 9, PAGE_SIZE, ctx->io_page, REQ_OP_WRITE);
+   io_offset += PAGE_SIZE;
+   } else if (le16_to_cpu(payload->type) == R5LOG_PAYLOAD_PARITY) {
+   int i;
+   ASSERT(le32_to_cpu(payload->size) == 16);
+   for (i = 0; i < le32_to_cpu(payload->size); i++) {
+   /* liubo: parity are guaranteed to be
+* contiguous, use just one bio to
+* hold all pages and flush them. */
+   u64 parity_off = le64_to_cpu(payload->location) 
+ i * PAGE_SIZE;
+   btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos 
+ io_offset) >> 9, PAGE_SIZE, ctx->io_page, REQ_OP_READ);
+   btrfs_r5l_sync_page_io(log, dev, parity_off >> 
9, PAGE_SIZE, ctx->io_page,

[PATCH 04/14] Btrfs: raid56: add verbose debug

2017-08-01 Thread Liu Bo

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c  | 2 ++
 fs/btrfs/volumes.c | 7 ++-
 fs/btrfs/volumes.h | 4 
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 2b91b95..c75766f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2753,7 +2753,9 @@ int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct 
btrfs_device *device)
cmpxchg(_info->r5log, NULL, log);
ASSERT(fs_info->r5log == log);
 
+#ifdef BTRFS_DEBUG_R5LOG
trace_printk("r5log: set a r5log in fs_info,  alloc_range 0x%llx 
0x%llx",
 log->data_offset, log->data_offset + log->device_size);
+#endif
return 0;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a17a488..ac64d93 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4731,8 +4731,13 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
 
if (!device->in_fs_metadata ||
device->is_tgtdev_for_dev_replace ||
-   (device->type & BTRFS_DEV_RAID56_LOG))
+   (device->type & BTRFS_DEV_RAID56_LOG)) {
+#ifdef BTRFS_DEBUG_R5LOG
+   if (device->type & BTRFS_DEV_RAID56_LOG)
+   btrfs_info(info, "skip a r5log when alloc 
chunk\n");
+#endif
continue;
+   }
 
if (device->total_bytes > device->bytes_used)
total_avail = device->total_bytes - device->bytes_used;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 60e347a..44cc3fa 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -26,6 +26,10 @@
 
 extern struct mutex uuid_mutex;
 
+#ifdef CONFIG_BTRFS_DEBUG
+#define BTRFS_DEBUG_R5LOG
+#endif
+
 #define BTRFS_STRIPE_LEN   SZ_64K
 
 struct buffer_head;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] Btrfs-progs: introduce super_journal_tail to inspect-dump-super

2017-08-01 Thread Liu Bo

We've record journal_tail of raid5/6 log in super_block so that recovery
of raid5/6 log can scan from this position.

This teaches inspect-dump-super to acknowledge %journal_tail.

Signed-off-by: Liu Bo 
---
 cmds-inspect-dump-super.c | 2 ++
 ctree.h   | 6 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c
index 98e0270..baa4d1a 100644
--- a/cmds-inspect-dump-super.c
+++ b/cmds-inspect-dump-super.c
@@ -389,6 +389,8 @@ static void dump_superblock(struct btrfs_super_block *sb, 
int full)
   (unsigned long long)btrfs_super_log_root_transid(sb));
printf("log_root_level\t\t%llu\n",
   (unsigned long long)btrfs_super_log_root_level(sb));
+   printf("journal_tail\t\t%llu\n",
+  (unsigned long long)btrfs_super_journal_tail(sb));
printf("total_bytes\t\t%llu\n",
   (unsigned long long)btrfs_super_total_bytes(sb));
printf("bytes_used\t\t%llu\n",
diff --git a/ctree.h b/ctree.h
index 48ae890..d28d6f7 100644
--- a/ctree.h
+++ b/ctree.h
@@ -458,8 +458,10 @@ struct btrfs_super_block {
__le64 cache_generation;
__le64 uuid_tree_generation;
 
+   __le64 journal_tail;
+
/* future expansion */
-   __le64 reserved[30];
+   __le64 reserved[29];
u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 } __attribute__ ((__packed__));
@@ -2143,6 +2145,8 @@ BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct 
btrfs_super_block,
 log_root_transid, 64);
 BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block,
 log_root_level, 8);
+BTRFS_SETGET_STACK_FUNCS(super_journal_tail, struct btrfs_super_block,
+journal_tail, 64);
 BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block,
 total_bytes, 64);
 BTRFS_SETGET_STACK_FUNCS(super_bytes_used, struct btrfs_super_block,
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/14] Btrfs: raid56: initialize raid5/6 log after adding it

2017-08-01 Thread Liu Bo

We need to initialize the raid5/6 log after adding it, but we don't
want to race with concurrent writes.  So we initialize it before
assigning the log pointer in %fs_info.

Signed-off-by: Liu Bo 
---
 fs/btrfs/disk-io.c |  2 +-
 fs/btrfs/raid56.c  | 18 --
 fs/btrfs/raid56.h  |  3 ++-
 fs/btrfs/volumes.c |  2 ++
 4 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c2d8697..3fbd347 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3000,7 +3000,7 @@ int open_ctree(struct super_block *sb,
 * write-ahead log, the fsync'd data will never ends
 * up with being replayed by raid56 log.
 */
-   btrfs_r5l_load_log(fs_info, cp);
+   btrfs_r5l_load_log(fs_info, NULL, cp);
}
 
ret = btrfs_recover_balance(fs_info);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0bfc97a..b771d7d 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1943,14 +1943,28 @@ static int btrfs_r5l_recover_log(struct btrfs_r5l_log 
*log)
 }
 
 /* return 0 if success, otherwise return errors */
-int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp)
+int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, struct btrfs_r5l_log 
*r5log, u64 cp)
 {
-   struct btrfs_r5l_log *log = fs_info->r5log;
+   struct btrfs_r5l_log *log;
struct page *page;
struct btrfs_r5l_meta_block *mb;
bool create_new = false;
int ret;
 
+   if (r5log)
+   ASSERT(fs_info->r5log == NULL);
+   if (fs_info->r5log)
+   ASSERT(r5log == NULL);
+
+   if (fs_info->r5log)
+   log = fs_info->r5log;
+   else
+   /*
+* this only happens when adding the raid56 log for
+* the first time.
+*/
+   log = r5log;
+
ASSERT(log);
 
page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index f6d6f36..2cc64a3 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -140,5 +140,6 @@ struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct 
btrfs_fs_info *fs_info,
 void btrfs_r5l_init_log_post(struct btrfs_fs_info *fs_info,
 struct btrfs_r5l_log *log);
 int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device 
*device);
-int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp);
+int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info,
+  struct btrfs_r5l_log *r5log, u64 cp);
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 851c001..7f848d7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2521,6 +2521,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
}
 
if (is_r5log) {
+   /* initialize r5log with cp == 0. */
+   btrfs_r5l_load_log(fs_info, r5log, 0);
btrfs_r5l_init_log_post(fs_info, r5log);
}
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/14] Btrfs: raid56: detect raid56 log on mount

2017-08-01 Thread Liu Bo

We've put the flag BTRFS_DEV_RAID56_LOG in device->type, so we can
recognize the journal device of raid56 while reading the chunk tree.

Signed-off-by: Liu Bo 
---
 fs/btrfs/volumes.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5c50df7..a17a488 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6696,6 +6696,18 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
}
 
fill_device_from_item(leaf, dev_item, device);
+
+   if (device->type & BTRFS_DEV_RAID56_LOG) {
+   ret = btrfs_set_r5log(fs_info, device);
+   if (ret) {
+   btrfs_err(fs_info, "error %d on loading r5log", ret);
+   return ret;
+   }
+
+   btrfs_info(fs_info, "devid %llu uuid %pU is raid56 log",
+  device->devid, device->uuid);
+   }
+
device->in_fs_metadata = 1;
if (device->writeable && !device->is_tgtdev_for_dev_replace) {
device->fs_devices->total_rw_bytes += device->total_bytes;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/14] Btrfs: raid56: do not allocate chunk on raid56 log

2017-08-01 Thread Liu Bo

The journal device (aka raid56 log) is not for chunk allocation, lets
skip it.

Signed-off-by: Liu Bo 
---
 fs/btrfs/volumes.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dafc541..5c50df7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4730,7 +4730,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
}
 
if (!device->in_fs_metadata ||
-   device->is_tgtdev_for_dev_replace)
+   device->is_tgtdev_for_dev_replace ||
+   (device->type & BTRFS_DEV_RAID56_LOG))
continue;
 
if (device->total_bytes > device->bytes_used)
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Liu Bo

This aims to fix write hole issue on btrfs raid5/6 setup by adding a
separate disk as a journal (aka raid5/6 log), so that after unclean
shutdown we can make sure data and parity are consistent on the raid
array by replaying the journal.

The idea and the code are similar to the write-through mode of md
raid5-cache, so ppl(partial parity log) is also feasible to implement.
(If you've been familiar with md, you may find this patch set is
boring to read...)

Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
the implementation, the rest patches are improvements and bugfixes,
eg. readahead for recovery, checksum.

Two btrfs-progs patches are required to play with this patch set, one
is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
option '-L', the other is to teach 'btrfs-show-super' to show
%journal_tail.

This is currently based on 4.12-rc3.

The patch set is tagged with RFC, and comments are always welcome,
thanks.

Known limitations:
- Deleting a log device is not implemented yet.


Liu Bo (14):
  Btrfs: raid56: add raid56 log via add_dev v2 ioctl
  Btrfs: raid56: do not allocate chunk on raid56 log
  Btrfs: raid56: detect raid56 log on mount
  Btrfs: raid56: add verbose debug
  Btrfs: raid56: add stripe log for raid5/6
  Btrfs: raid56: add reclaim support
  Btrfs: raid56: load r5log
  Btrfs: raid56: log recovery
  Btrfs: raid56: add readahead for recovery
  Btrfs: raid56: use the readahead helper to get page
  Btrfs: raid56: add csum support
  Btrfs: raid56: fix error handling while adding a log device
  Btrfs: raid56: initialize raid5/6 log after adding it
  Btrfs: raid56: maintain IO order on raid5/6 log

 fs/btrfs/ctree.h|   16 +-
 fs/btrfs/disk-io.c  |   16 +
 fs/btrfs/ioctl.c|   48 +-
 fs/btrfs/raid56.c   | 1429 ++-
 fs/btrfs/raid56.h   |   82 +++
 fs/btrfs/transaction.c  |2 +
 fs/btrfs/volumes.c  |   56 +-
 fs/btrfs/volumes.h  |7 +-
 include/uapi/linux/btrfs.h  |3 +
 include/uapi/linux/btrfs_tree.h |4 +
 10 files changed, 1487 insertions(+), 176 deletions(-)

-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/14] Btrfs: raid56: add csum support

2017-08-01 Thread Liu Bo

This is adding checksum to meta/data/parity resident on the raid5/6
log.  So recovery now can verify checksum to see if anything inside
meta/data/parity has been changed.

If anything is wrong in meta block, we stops replaying data/parity at
that position, while if anything is wrong in data/parity block, we
just skip this this meta/data/parity pair and move onto the next one.

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c | 235 --
 fs/btrfs/raid56.h |   4 +
 2 files changed, 197 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 8f47e56..8bc7ba4 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -43,6 +43,7 @@
 #include "async-thread.h"
 #include "check-integrity.h"
 #include "rcu-string.h"
+#include "hash.h"
 
 /* set when additional merges to this rbio are not allowed */
 #define RBIO_RMW_LOCKED_BIT1
@@ -197,6 +198,7 @@ struct btrfs_r5l_log {
u64 last_cp_seq;
u64 seq;
u64 log_start;
+   u32 uuid_csum;
struct btrfs_r5l_io_unit *current_io;
 };
 
@@ -1309,7 +1311,7 @@ static int btrfs_r5l_get_meta(struct btrfs_r5l_log *log, 
struct btrfs_raid_bio *
return 0;
 }
 
-static void btrfs_r5l_append_payload_meta(struct btrfs_r5l_log *log, u16 type, 
u64 location, u64 devid)
+static void btrfs_r5l_append_payload_meta(struct btrfs_r5l_log *log, u16 type, 
u64 location, u64 devid, u32 csum)
 {
struct btrfs_r5l_io_unit *io = log->current_io;
struct btrfs_r5l_payload *payload;
@@ -1326,11 +1328,11 @@ static void btrfs_r5l_append_payload_meta(struct 
btrfs_r5l_log *log, u16 type, u
payload->size = cpu_to_le32(16); /* stripe_len / PAGE_SIZE */
payload->devid = cpu_to_le64(devid);
payload->location = cpu_to_le64(location);
+   payload->csum = cpu_to_le32(csum);
kunmap(io->meta_page);
 
-   /* XXX: add checksum later */
io->meta_offset += sizeof(*payload);
-   //io->meta_offset += sizeof(__le32);
+
 #ifdef BTRFS_DEBUG_R5LOG
trace_printk("io->meta_offset %d\n", io->meta_offset);
 #endif
@@ -1380,6 +1382,10 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log 
*log, int data_pages, int
int meta_size;
int stripe, pagenr;
struct page *page;
+   char *kaddr;
+   u32 csum;
+   u64 location;
+   u64 devid;
 
/*
 * parity pages are contiguous on disk, thus only one
@@ -1394,8 +1400,6 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log 
*log, int data_pages, int
/* add data blocks which need to be written */
for (stripe = 0; stripe < rbio->nr_data; stripe++) {
for (pagenr = 0; pagenr < rbio->stripe_npages; pagenr++) {
-   u64 location;
-   u64 devid;
if (stripe < rbio->nr_data) {
page = page_in_rbio(rbio, stripe, pagenr, 1);
if (!page)
@@ -1406,7 +1410,11 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log 
*log, int data_pages, int
 #ifdef BTRFS_DEBUG_R5LOG
trace_printk("data: stripe %d pagenr %d 
location 0x%llx devid %llu\n", stripe, pagenr, location, devid);
 #endif
-   btrfs_r5l_append_payload_meta(log, 
R5LOG_PAYLOAD_DATA, location, devid);
+   kaddr = kmap(page);
+   csum = btrfs_crc32c(log->uuid_csum, kaddr, 
PAGE_SIZE);
+   kunmap(page);
+
+   btrfs_r5l_append_payload_meta(log, 
R5LOG_PAYLOAD_DATA, location, devid, csum);
btrfs_r5l_append_payload_page(log, page);
}
}
@@ -1414,17 +1422,26 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log 
*log, int data_pages, int
 
/* add the whole parity blocks */
for (; stripe < rbio->real_stripes; stripe++) {
-   u64 location = btrfs_compute_location(rbio, stripe, 0);
-   u64 devid = btrfs_compute_devid(rbio, stripe);
+   location = btrfs_compute_location(rbio, stripe, 0);
+   devid = btrfs_compute_devid(rbio, stripe);
 
 #ifdef BTRFS_DEBUG_R5LOG
trace_printk("parity: stripe %d location 0x%llx devid %llu\n", 
stripe, location, devid);
 #endif
-   btrfs_r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY, 
location, devid);
for (pagenr = 0; pagenr < rbio->stripe_npages; pagenr++) {
page = rbio_stripe_page(rbio, stripe, pagenr);
+
+   kaddr = kmap(page);
+   if (pagenr == 0)
+   csum = btrfs_crc32c(log->uuid_csum, kaddr, 
PAGE_SIZE);
+   else
+   csum = btrfs_crc32c(csum, kaddr, PAGE_SIZE);
+   kunmap(page);

Re: Massive loss of disk space

2017-08-01 Thread Austin S. Hemmelgarn


On 2017-08-01 12:50, pwm wrote:
I did a temporary patch of the snapraid code to start fallocate() from 
the previous parity file size.
Like I said though, it's BTRFS that's misbehaving here, not snapraid. 
I'm going to try to get some further discussion about this here on the 
mailing list,and hopefully it will get fixed in BTRFS (I would try to do 
so myself, but I'm at best a novice at C, and not well versed in kernel 
code).


Finally have a snapraid sync up and running. Looks good, but will take 
quite a while before I can try a scrub command to double-check everything.


Thanks for the help.

Glad I could be helpful!


/Per W

On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:


On 2017-08-01 11:24, pwm wrote:
Yes, the test code is as below - trying to match what snapraid tries 
to do:


#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
 int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
 if (fd < 0) {
 printf("Failed opening parity file [%s]\n",strerror(errno));
 return 1;
 }

 off_t filesize = 5151751667712ull;
 int res;

 struct stat statbuf;
 if (fstat(fd,)) {
 printf("Failed stat [%s]\n",strerror(errno));
 close(fd);
 return 1;
 }

 printf("Original file size is  %llu bytes\n",i
(unsigned long long)statbuf.st_size);
 printf("Trying to grow file to %llu bytes\n",i
(unsigned long long)filesize);

 res = fallocate(fd,0,0,filesize);
 if (res) {
 printf("Failed fallocate [%s]\n",strerror(errno));
 close(fd);
 return 1;
 }

 if (fsync(fd)) {
 printf("Failed fsync [%s]\n",fsync(errno));
 close(fd);
 return 1;
 }

 close(fd);
 return 0;
}

So the call doesn't make use of the previous file size as offset for 
the extension.


int fallocate(int fd, int mode, off_t offset, off_t len);

What you are implying here is that if the fallocate() call is 
modified to:


   res = fallocate(fd,0,old_size,new_size-old_size);

then everything should work as expected?
Based on what I've seen testing on my end, yes, that should cause 
things to work correctly.  That said, given what snapraid does, the 
fact that they call fallocate covering the full desired size of the 
file is correct usage (the point is to make behavior deterministic, 
and calling it on the whole file makes sure that the file isn't 
sparse, which can impact performance).


Given both the fact that calling fallocate() to extend the file 
without worrying about an offset is a legitimate use case, and that 
both ext4 and XFS (and I suspect almost every other Linux filesystem) 
works in this situation, I'd argue that the behavior of BTRFS in this 
situation is incorrect.


/Per W

On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:


On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:

On 2017-08-01 10:39, pwm wrote:

Thanks for the links and suggestions.

I did try your suggestions but it didn't solve the underlying 
problem.




pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
Dumping filters: flags 0x1, state 0x0, force is off
   DATA (flags 0x2): balancing, usage=20
Done, had to relocate 4596 out of 9317 chunks


pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft 
/mnt/snap_04/

Done, had to relocate 2 out of 4721 chunks


pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
Data, single: total=4.60TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=512.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 4.61TiB path /dev/sdg1


So now device 1 usage is down from 9.09TiB to 4.61TiB.

But if I test to fallocate() to grow the large parity file, I 
directly fail. I wrote a little help program that just focuses on 
fallocate() instead of having to run snapraid with lots of unknown 
additional actions being performed.



Original file size is  5050486226944 bytes
Trying to grow file to 5151751667712 bytes
Failed fallocate [No space left on device]



And result after shows 'used' have jumped up to 9.09TiB again.

root@europium:/mnt# btrfs fi show snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 9.09TiB path /dev/sdg1

root@europium:/mnt# btrfs fi df /mnt/snap_04/
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=992.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


It's almost like the file system have decided that it needs to 
make a snapshot and store two complete copies of the complete 
file, which is obviously not going to work with a file larger than 
50% of the file system.
I think I _might_

Re: Massive loss of disk space

2017-08-01 Thread pwm

I did a temporary patch of the snapraid code to start fallocate() from the 
previous parity file size.


Finally have a snapraid sync up and running. Looks good, but will take 
quite a while before I can try a scrub command to double-check everything.


Thanks for the help.

/Per W

On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:


On 2017-08-01 11:24, pwm wrote:

Yes, the test code is as below - trying to match what snapraid tries to do:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
 int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
 if (fd < 0) {
 printf("Failed opening parity file [%s]\n",strerror(errno));
 return 1;
 }

 off_t filesize = 5151751667712ull;
 int res;

 struct stat statbuf;
 if (fstat(fd,)) {
 printf("Failed stat [%s]\n",strerror(errno));
 close(fd);
 return 1;
 }

 printf("Original file size is  %llu bytes\n",i
(unsigned long long)statbuf.st_size);
 printf("Trying to grow file to %llu bytes\n",i
(unsigned long long)filesize);

 res = fallocate(fd,0,0,filesize);
 if (res) {
 printf("Failed fallocate [%s]\n",strerror(errno));
 close(fd);
 return 1;
 }

 if (fsync(fd)) {
 printf("Failed fsync [%s]\n",fsync(errno));
 close(fd);
 return 1;
 }

 close(fd);
 return 0;
}

So the call doesn't make use of the previous file size as offset for the 
extension.


int fallocate(int fd, int mode, off_t offset, off_t len);

What you are implying here is that if the fallocate() call is modified to:

   res = fallocate(fd,0,old_size,new_size-old_size);

then everything should work as expected?
Based on what I've seen testing on my end, yes, that should cause things to 
work correctly.  That said, given what snapraid does, the fact that they call 
fallocate covering the full desired size of the file is correct usage (the 
point is to make behavior deterministic, and calling it on the whole file 
makes sure that the file isn't sparse, which can impact performance).


Given both the fact that calling fallocate() to extend the file without 
worrying about an offset is a legitimate use case, and that both ext4 and XFS 
(and I suspect almost every other Linux filesystem) works in this situation, 
I'd argue that the behavior of BTRFS in this situation is incorrect.


/Per W

On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:


On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:

On 2017-08-01 10:39, pwm wrote:

Thanks for the links and suggestions.

I did try your suggestions but it didn't solve the underlying problem.



pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
Dumping filters: flags 0x1, state 0x0, force is off
   DATA (flags 0x2): balancing, usage=20
Done, had to relocate 4596 out of 9317 chunks


pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft 
/mnt/snap_04/

Done, had to relocate 2 out of 4721 chunks


pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
Data, single: total=4.60TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=512.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 4.61TiB path /dev/sdg1


So now device 1 usage is down from 9.09TiB to 4.61TiB.

But if I test to fallocate() to grow the large parity file, I directly 
fail. I wrote a little help program that just focuses on fallocate() 
instead of having to run snapraid with lots of unknown additional 
actions being performed.



Original file size is  5050486226944 bytes
Trying to grow file to 5151751667712 bytes
Failed fallocate [No space left on device]



And result after shows 'used' have jumped up to 9.09TiB again.

root@europium:/mnt# btrfs fi show snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 9.09TiB path /dev/sdg1

root@europium:/mnt# btrfs fi df /mnt/snap_04/
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=992.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


It's almost like the file system have decided that it needs to make a 
snapshot and store two complete copies of the complete file, which is 
obviously not going to work with a file larger than 50% of the file 
system.
I think I _might_ understand what's going on here.  Is that test program 
calling fallocate using the desired total size of the file, or just 
trying to allocate the range beyond the end to extend the file?  I've 
seen issues with the first case on BTRFS before, and I'm starting to 
think that it might actually be trying to allocate the exact amount of 
space requested by fallocate,

Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)

2017-08-01 Thread Ivan Sizov

2017-08-01 0:39 GMT+03:00 Ivan Sizov :
> 2017-08-01 0:17 GMT+03:00 Marc MERLIN :
>> On Tue, Aug 01, 2017 at 12:07:14AM +0300, Ivan Sizov wrote:
>>> 2017-07-09 10:57 GMT+03:00 Martin Steigerwald :
>>> > Hello Marc.
>>> >
>>> > Marc MERLIN - 08.07.17, 21:34:
>>> >> Sigh,
>>> >>
>>> >> This is now the 3rd filesystem I have (on 3 different machines) that is
>>> >> getting corruption of some kind (on 4.11.6).
>>> >
>>> > Anyone else getting corruptions with 4.11?
>>> Yes, a lot. There are at least 3 cases, probably I've missed something.
>>> https://www.spinics.net/lists/linux-btrfs/msg67177.html
>>> https://www.spinics.net/lists/linux-btrfs/msg67681.html
>>> https://unix.stackexchange.com/questions/369133/dealing-with-btrfs-ref-backpointer-mismatches-backref-missing/369275
>>
>> Indeed. My main server is happy back on 4.9.36 and while my laptop is
>> stuck on 4.11 due to other kernel issues that prevent me from going back
>> to 4.9, it only corrupted a single filesystem so far, and no other ones
>> that I've noticed yet.
>> Hopefully that will hold :-/
>>
>> Marc
>> --
>> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>> Microsoft is to operating systems 
>>    what McDonalds is to gourmet 
>> cooking
>> Home page: http://marc.merlins.org/ | PGP 
>> 1024R/763BE901
>
> I want to try mounting and checking FS under Live images with
> different kernels tomorrow. Today's Fedora Rawhide image seems to be
> built incorrectly. Can you advice me where to get a fresh live image
> with 4.12 kernel (it's not important which distro that will be)?
>
> --
> Ivan Sizov
Mounting problem persists:
on 4.13.0 with btrfs-progs v4.11.1 (latest Fedora Rawhide Live)
on 4.10.0 with btrfs-progs v4.9.1 (Ubuntu 17.04 Live)
on 4.9.0 with btrfs-progs v 4.7.3 (Debian 9 Stretch Live)
"btrfs check --readonly" also gives the same output on 4.11, 4.10 and 4.9.

Marc, how did you roll back and fix those errors?

-- 
Ivan Sizov
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Massive loss of disk space

2017-08-01 Thread Austin S. Hemmelgarn


On 2017-08-01 11:24, pwm wrote:

Yes, the test code is as below - trying to match what snapraid tries to do:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
 int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
 if (fd < 0) {
 printf("Failed opening parity file [%s]\n",strerror(errno));
 return 1;
 }

 off_t filesize = 5151751667712ull;
 int res;

 struct stat statbuf;
 if (fstat(fd,)) {
 printf("Failed stat [%s]\n",strerror(errno));
 close(fd);
 return 1;
 }

 printf("Original file size is  %llu bytes\n",i
(unsigned long long)statbuf.st_size);
 printf("Trying to grow file to %llu bytes\n",i
(unsigned long long)filesize);

 res = fallocate(fd,0,0,filesize);
 if (res) {
 printf("Failed fallocate [%s]\n",strerror(errno));
 close(fd);
 return 1;
 }

 if (fsync(fd)) {
 printf("Failed fsync [%s]\n",fsync(errno));
 close(fd);
 return 1;
 }

 close(fd);
 return 0;
}

So the call doesn't make use of the previous file size as offset for the 
extension.


int fallocate(int fd, int mode, off_t offset, off_t len);

What you are implying here is that if the fallocate() call is modified to:

   res = fallocate(fd,0,old_size,new_size-old_size);

then everything should work as expected?
Based on what I've seen testing on my end, yes, that should cause things 
to work correctly.  That said, given what snapraid does, the fact that 
they call fallocate covering the full desired size of the file is 
correct usage (the point is to make behavior deterministic, and calling 
it on the whole file makes sure that the file isn't sparse, which can 
impact performance).


Given both the fact that calling fallocate() to extend the file without 
worrying about an offset is a legitimate use case, and that both ext4 
and XFS (and I suspect almost every other Linux filesystem) works in 
this situation, I'd argue that the behavior of BTRFS in this situation 
is incorrect.


/Per W

On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:


On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:

On 2017-08-01 10:39, pwm wrote:

Thanks for the links and suggestions.

I did try your suggestions but it didn't solve the underlying problem.



pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
Dumping filters: flags 0x1, state 0x0, force is off
   DATA (flags 0x2): balancing, usage=20
Done, had to relocate 4596 out of 9317 chunks


pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft 
/mnt/snap_04/

Done, had to relocate 2 out of 4721 chunks


pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
Data, single: total=4.60TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=512.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 4.61TiB path /dev/sdg1


So now device 1 usage is down from 9.09TiB to 4.61TiB.

But if I test to fallocate() to grow the large parity file, I 
directly fail. I wrote a little help program that just focuses on 
fallocate() instead of having to run snapraid with lots of unknown 
additional actions being performed.



Original file size is  5050486226944 bytes
Trying to grow file to 5151751667712 bytes
Failed fallocate [No space left on device]



And result after shows 'used' have jumped up to 9.09TiB again.

root@europium:/mnt# btrfs fi show snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 9.09TiB path /dev/sdg1

root@europium:/mnt# btrfs fi df /mnt/snap_04/
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=992.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


It's almost like the file system have decided that it needs to make 
a snapshot and store two complete copies of the complete file, which 
is obviously not going to work with a file larger than 50% of the 
file system.
I think I _might_ understand what's going on here.  Is that test 
program calling fallocate using the desired total size of the file, 
or just trying to allocate the range beyond the end to extend the 
file?  I've seen issues with the first case on BTRFS before, and I'm 
starting to think that it might actually be trying to allocate the 
exact amount of space requested by fallocate, even if part of the 
range is already allocated space.


OK, I just did a dead simple test by hand, and it looks like I was 
right. The method I used to check this is as follows:
1. Create and mount a reasonably small filesystem (I used an 8G 
temporary LV for this, a file would work too though).
2. Using dd or a similar tool,

Re: [PATCH] btrfs: copy fsid to super_block s_uuid

2017-08-01 Thread Darrick J. Wong

On Tue, Aug 01, 2017 at 06:35:08PM +0800, Anand Jain wrote:
> We didn't copy fsid to struct super_block.s_uuid so Overlay disables
> index feature with btrfs as the lower FS.
> 
> kernel: overlayfs: fs on '/lower' does not support file handles, falling back 
> to index=off.
> 
> Fix this by publishing the fsid through struct super_block.s_uuid.
> 
> Signed-off-by: Anand Jain 
> ---
> I tried to know if in case did we deliberately missed this for some reason,
> however there is no information on that. If we mount a non-default subvol in
> the next mount/remount, its still the same FS, so publishing the FSID
> instead of subvol uuid is correct, OR I can't think any other reason for
> not using s_uuid for btrfs.
> 
> 
>  fs/btrfs/disk-io.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 080e2ebb8aa0..b7e72d040442 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2899,6 +2899,7 @@ int open_ctree(struct super_block *sb,
>  
>   sb->s_blocksize = sectorsize;
>   sb->s_blocksize_bits = blksize_bits(sectorsize);
> + memcpy(>s_uuid, fs_info->fsid, BTRFS_FSID_SIZE);

uuid_copy()?

--D

>  
>   mutex_lock(_info->chunk_mutex);
>   ret = btrfs_read_sys_array(fs_info);
> -- 
> 2.13.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: remove broken memory barrier

2017-08-01 Thread Nikolay Borisov

Commit 38851cc19adb ("Btrfs: implement unlocked dio write") implemented
unlocked dio write, allowing multiple dio writers to write to non-overlapping,
and non-eof-extending regions. In doing so it also introduced a broken memory
barrier. It is broken due to 2 things:

1. Memory barriers _MUST_ always be paired, this is clearly not the case here

2. Checkpatch actually produces a warning if a memory barrier is introduced that
doesn't have a comment explaining how it's being paired.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/inode.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 95c212037095..5e48d2c10152 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8731,7 +8731,6 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
return 0;
 
inode_dio_begin(inode);
-   smp_mb__after_atomic();
 
/*
 * The generic stuff only does filemap_write_and_wait_range, which
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Massive loss of disk space

2017-08-01 Thread pwm

Yes, the test code is as below - trying to match what snapraid tries 
to do:


#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
if (fd < 0) {
printf("Failed opening parity file [%s]\n",strerror(errno));
return 1;
}

off_t filesize = 5151751667712ull;
int res;

struct stat statbuf;
if (fstat(fd,)) {
printf("Failed stat [%s]\n",strerror(errno));
close(fd);
return 1;
}

printf("Original file size is  %llu bytes\n",i
   (unsigned long long)statbuf.st_size);
printf("Trying to grow file to %llu bytes\n",i
   (unsigned long long)filesize);

res = fallocate(fd,0,0,filesize);
if (res) {
printf("Failed fallocate [%s]\n",strerror(errno));
close(fd);
return 1;
}

if (fsync(fd)) {
printf("Failed fsync [%s]\n",fsync(errno));
close(fd);
return 1;
}

close(fd);
return 0;
}

So the call doesn't make use of the previous file size as offset for the 
extension.


int fallocate(int fd, int mode, off_t offset, off_t len);

What you are implying here is that if the fallocate() call is modified to:

  res = fallocate(fd,0,old_size,new_size-old_size);

then everything should work as expected?

/Per W

On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:


On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:

On 2017-08-01 10:39, pwm wrote:

Thanks for the links and suggestions.

I did try your suggestions but it didn't solve the underlying problem.



pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
Dumping filters: flags 0x1, state 0x0, force is off
   DATA (flags 0x2): balancing, usage=20
Done, had to relocate 4596 out of 9317 chunks


pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
Done, had to relocate 2 out of 4721 chunks


pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
Data, single: total=4.60TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=512.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 4.61TiB path /dev/sdg1


So now device 1 usage is down from 9.09TiB to 4.61TiB.

But if I test to fallocate() to grow the large parity file, I directly 
fail. I wrote a little help program that just focuses on fallocate() 
instead of having to run snapraid with lots of unknown additional actions 
being performed.



Original file size is  5050486226944 bytes
Trying to grow file to 5151751667712 bytes
Failed fallocate [No space left on device]



And result after shows 'used' have jumped up to 9.09TiB again.

root@europium:/mnt# btrfs fi show snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 9.09TiB path /dev/sdg1

root@europium:/mnt# btrfs fi df /mnt/snap_04/
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=992.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


It's almost like the file system have decided that it needs to make a 
snapshot and store two complete copies of the complete file, which is 
obviously not going to work with a file larger than 50% of the file 
system.
I think I _might_ understand what's going on here.  Is that test program 
calling fallocate using the desired total size of the file, or just trying 
to allocate the range beyond the end to extend the file?  I've seen issues 
with the first case on BTRFS before, and I'm starting to think that it 
might actually be trying to allocate the exact amount of space requested by 
fallocate, even if part of the range is already allocated space.


OK, I just did a dead simple test by hand, and it looks like I was right. 
The method I used to check this is as follows:
1. Create and mount a reasonably small filesystem (I used an 8G temporary LV 
for this, a file would work too though).
2. Using dd or a similar tool, create a test file that takes up half of the 
size of the filesystem.  It is important that this _not_ be fallocated, but 
just written out.
3. Use `fallocate -l` to try and extend the size of the file beyond half the 
size of the filesystem.


For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will 
succeed with no error.  Based on this and some low-level inspection, it looks 
like BTRFS treats the full range of the fallocate call as unallocated, and 
thus is trying to allocate space for regions of that range that are already 
allocated.




No issue at all to grow the parity file on the other parity disk. And 
that's why I wonder if there is some undetected file system corruption.





--
To unsubscribe from this list:

Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)

2017-08-01 Thread Imran Geriskovan

On 8/1/17, Duncan <1i5t5.dun...@cox.net> wrote:
> Imran Geriskovan posted on Mon, 31 Jul 2017 22:32:39 +0200 as excerpted:
 Now the init on /boot is a "19 lines" shell script, including lines
 for keymap, hdparm, crytpsetup. And let's not forget this is possible
 by a custom kernel and its reliable buddy syslinux.

>>> And I'm using dracut for that, tho quite cut down from its default,
>>> with a monolithic kernel and only installing necessary dracut modules.

>> Just create minimal bootable /boot for running below init.
>> (Your initramfs/rd is a bloated and packaged version of this anyway.)
>> Kick the rest. Since you a have your own kernel you are not far away
>> from it.

> Thanks.  You just solved my primary problem of needing to take the time
> to actually research all the steps and in what order I needed to do them,
> for a hand-rolled script. =:^)

It's just a minimal one. But it is a good start. For possible extensions
extract your initramfs and explore it. Dracut is bloated. Try mkinitcpio.

Once your have your self hosting bootmng, kernel, modules, /boot, init, etc
chain, you'll be shocked to realize you have been spending so much time for
that bullshit while trying to keep them up..

Get to this point in the shortest possible time. Save your precious
time. And reclaim your systems reliability.

For X, you'll still need udev or eudev.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Massive loss of disk space

2017-08-01 Thread Austin S. Hemmelgarn


On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:

On 2017-08-01 10:39, pwm wrote:

Thanks for the links and suggestions.

I did try your suggestions but it didn't solve the underlying problem.



pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
Dumping filters: flags 0x1, state 0x0, force is off
   DATA (flags 0x2): balancing, usage=20
Done, had to relocate 4596 out of 9317 chunks


pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
Done, had to relocate 2 out of 4721 chunks


pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
Data, single: total=4.60TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=512.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 4.61TiB path /dev/sdg1


So now device 1 usage is down from 9.09TiB to 4.61TiB.

But if I test to fallocate() to grow the large parity file, I directly 
fail. I wrote a little help program that just focuses on fallocate() 
instead of having to run snapraid with lots of unknown additional 
actions being performed.



Original file size is  5050486226944 bytes
Trying to grow file to 5151751667712 bytes
Failed fallocate [No space left on device]



And result after shows 'used' have jumped up to 9.09TiB again.

root@europium:/mnt# btrfs fi show snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 9.09TiB path /dev/sdg1

root@europium:/mnt# btrfs fi df /mnt/snap_04/
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=992.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


It's almost like the file system have decided that it needs to make a 
snapshot and store two complete copies of the complete file, which is 
obviously not going to work with a file larger than 50% of the file 
system.
I think I _might_ understand what's going on here.  Is that test program 
calling fallocate using the desired total size of the file, or just 
trying to allocate the range beyond the end to extend the file?  I've 
seen issues with the first case on BTRFS before, and I'm starting to 
think that it might actually be trying to allocate the exact amount of 
space requested by fallocate, even if part of the range is already 
allocated space.


OK, I just did a dead simple test by hand, and it looks like I was 
right.  The method I used to check this is as follows:
1. Create and mount a reasonably small filesystem (I used an 8G 
temporary LV for this, a file would work too though).
2. Using dd or a similar tool, create a test file that takes up half of 
the size of the filesystem.  It is important that this _not_ be 
fallocated, but just written out.
3. Use `fallocate -l` to try and extend the size of the file beyond half 
the size of the filesystem.


For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will 
succeed with no error.  Based on this and some low-level inspection, it 
looks like BTRFS treats the full range of the fallocate call as 
unallocated, and thus is trying to allocate space for regions of that 
range that are already allocated.




No issue at all to grow the parity file on the other parity disk. And 
that's why I wonder if there is some undetected file system corruption.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Massive loss of disk space

2017-08-01 Thread Austin S. Hemmelgarn


On 2017-08-01 10:39, pwm wrote:

Thanks for the links and suggestions.

I did try your suggestions but it didn't solve the underlying problem.



pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
Dumping filters: flags 0x1, state 0x0, force is off
   DATA (flags 0x2): balancing, usage=20
Done, had to relocate 4596 out of 9317 chunks


pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
Done, had to relocate 2 out of 4721 chunks


pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
Data, single: total=4.60TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=512.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 4.61TiB path /dev/sdg1


So now device 1 usage is down from 9.09TiB to 4.61TiB.

But if I test to fallocate() to grow the large parity file, I directly 
fail. I wrote a little help program that just focuses on fallocate() 
instead of having to run snapraid with lots of unknown additional 
actions being performed.



Original file size is  5050486226944 bytes
Trying to grow file to 5151751667712 bytes
Failed fallocate [No space left on device]



And result after shows 'used' have jumped up to 9.09TiB again.

root@europium:/mnt# btrfs fi show snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
 Total devices 1 FS bytes used 4.60TiB
 devid1 size 9.09TiB used 9.09TiB path /dev/sdg1

root@europium:/mnt# btrfs fi df /mnt/snap_04/
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=992.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


It's almost like the file system have decided that it needs to make a 
snapshot and store two complete copies of the complete file, which is 
obviously not going to work with a file larger than 50% of the file system.
I think I _might_ understand what's going on here.  Is that test program 
calling fallocate using the desired total size of the file, or just 
trying to allocate the range beyond the end to extend the file?  I've 
seen issues with the first case on BTRFS before, and I'm starting to 
think that it might actually be trying to allocate the exact amount of 
space requested by fallocate, even if part of the range is already 
allocated space.


No issue at all to grow the parity file on the other parity disk. And 
that's why I wonder if there is some undetected file system corruption.


/Per W

On Tue, 1 Aug 2017, Hugo Mills wrote:


  Hi, Per,

  Start here:

https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29 



  In your case, I'd suggest using "-dusage=20" to start with, as
it'll probably free up quite a lot of your existing allocation.

And this may also be of interest, in how to read the output of the
tools:

https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools 



  Finally, I note that you've still got some "single" chunks present
for metadata. It won't affect your space allocation issues, but I
would recommend getting rid of them anyway:

# btrfs balance start -mconvert=dup,soft

  Hugo.

On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote:

I have a 10TB file system with a parity file for a snapraid.
However, I can suddenly not extend the parity file despite the file
system only being about 50% filled - I should have 5TB of
unallocated space. When trying to extend the parity file,
fallocate() just returns ENOSPC, i.e. that the disk is full.

Machine was originally a Debian 8 (Jessie) but after I detected the
issue and no btrfs tool did show any errors, I have updated to
Debian 9 (Snatch) to get a newer kernel and newer btrfs tools.

pwm@europium:/mnt$ btrfs --version
btrfs-progs v4.7.3
pwm@europium:/mnt$ uname -a
Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2
(2017-06-26) x86_64 GNU/Linux




pwm@europium:/mnt/snap_04$ ls -l
total 4932703608
-rw--- 1 root root 319148889 Jul  8 04:21 snapraid.content
-rw--- 1 root root 283115520 Aug  1 04:08 snapraid.content.tmp
-rw--- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity



pwm@europium:/mnt/snap_04$ df .
Filesystem  1K-blocks   Used  Available Use% Mounted on
/dev/sdg1  9766434816 4944614648 4819831432  51% /mnt/snap_04



pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
Total devices 1 FS bytes used 4.60TiB
devid1 size 9.09TiB used 9.09TiB path /dev/sdg1

Compare this with the second snapraid parity disk:
pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
Label: 'snap_05'  uuid: bac477e3-e78c-43ee-8402-6bdfff194567
Total devices 1 FS bytes used 4.69TiB
devid1 size 9.09TiB used 4.70TiB path

Re: Massive loss of disk space

2017-08-01 Thread pwm


Thanks for the links and suggestions.

I did try your suggestions but it didn't solve the underlying problem.



pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
Dumping filters: flags 0x1, state 0x0, force is off
  DATA (flags 0x2): balancing, usage=20
Done, had to relocate 4596 out of 9317 chunks


pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
Done, had to relocate 2 out of 4721 chunks


pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
Data, single: total=4.60TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=512.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
Total devices 1 FS bytes used 4.60TiB
devid1 size 9.09TiB used 4.61TiB path /dev/sdg1


So now device 1 usage is down from 9.09TiB to 4.61TiB.

But if I test to fallocate() to grow the large parity file, I directly 
fail. I wrote a little help program that just focuses on fallocate() 
instead of having to run snapraid with lots of unknown additional actions 
being performed.



Original file size is  5050486226944 bytes
Trying to grow file to 5151751667712 bytes
Failed fallocate [No space left on device]



And result after shows 'used' have jumped up to 9.09TiB again.

root@europium:/mnt# btrfs fi show snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
Total devices 1 FS bytes used 4.60TiB
devid1 size 9.09TiB used 9.09TiB path /dev/sdg1

root@europium:/mnt# btrfs fi df /mnt/snap_04/
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=992.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


It's almost like the file system have decided that it needs to make a 
snapshot and store two complete copies of the complete file, which is 
obviously not going to work with a file larger than 50% of the file 
system.


No issue at all to grow the parity file on the other parity disk. And 
that's why I wonder if there is some undetected file system corruption.


/Per W

On Tue, 1 Aug 2017, Hugo Mills wrote:


  Hi, Per,

  Start here:

https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29

  In your case, I'd suggest using "-dusage=20" to start with, as
it'll probably free up quite a lot of your existing allocation.

And this may also be of interest, in how to read the output of the
tools:

https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools

  Finally, I note that you've still got some "single" chunks present
for metadata. It won't affect your space allocation issues, but I
would recommend getting rid of them anyway:

# btrfs balance start -mconvert=dup,soft

  Hugo.

On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote:

I have a 10TB file system with a parity file for a snapraid.
However, I can suddenly not extend the parity file despite the file
system only being about 50% filled - I should have 5TB of
unallocated space. When trying to extend the parity file,
fallocate() just returns ENOSPC, i.e. that the disk is full.

Machine was originally a Debian 8 (Jessie) but after I detected the
issue and no btrfs tool did show any errors, I have updated to
Debian 9 (Snatch) to get a newer kernel and newer btrfs tools.

pwm@europium:/mnt$ btrfs --version
btrfs-progs v4.7.3
pwm@europium:/mnt$ uname -a
Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2
(2017-06-26) x86_64 GNU/Linux




pwm@europium:/mnt/snap_04$ ls -l
total 4932703608
-rw--- 1 root root 319148889 Jul  8 04:21 snapraid.content
-rw--- 1 root root 283115520 Aug  1 04:08 snapraid.content.tmp
-rw--- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity



pwm@europium:/mnt/snap_04$ df .
Filesystem  1K-blocks   Used  Available Use% Mounted on
/dev/sdg1  9766434816 4944614648 4819831432  51% /mnt/snap_04



pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
Total devices 1 FS bytes used 4.60TiB
devid1 size 9.09TiB used 9.09TiB path /dev/sdg1

Compare this with the second snapraid parity disk:
pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
Label: 'snap_05'  uuid: bac477e3-e78c-43ee-8402-6bdfff194567
Total devices 1 FS bytes used 4.69TiB
devid1 size 9.09TiB used 4.70TiB path /dev/sdi1

So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB.
While almost the same amount of file system usage. And almost
identical usage pattern. It's an archival RAID, so there is hardly
any writes to the parity files because there are almost no file
changes to the data files. The main usage is that the parity file
gets extended when one of the data disks reaches a new high water
mark.

The only file that gets regularly rewritten is the snapraid.content
file

Re: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Peter Grandi

> Peter, I don't think the filefrag is showing the correct
> fragmentation status of the file when the compression is used.

As reported on a previous message the output of 'filefrag -v'
which can be used to see what is going on:

 filefrag /mnt/sde3/testfile 
   /mnt/sde3/testfile: 49287 extents found

 Most the latter extents are mercifully rather contiguous, their
 size is just limited by the compression code, here is an extract
 from 'filefrag -v' from around the middle:

   24757:  1321888.. 1321919:   11339579..  11339610: 32:   11339594:
   24758:  1321920.. 1321951:   11339597..  11339628: 32:   11339611:
   24759:  1321952.. 1321983:   11339615..  11339646: 32:   11339629:
   24760:  1321984.. 1322015:   11339632..  11339663: 32:   11339647:
   24761:  1322016.. 1322047:   11339649..  11339680: 32:   11339664:
   24762:  1322048.. 1322079:   11339667..  11339698: 32:   11339681:
   24763:  1322080.. 1322111:   11339686..  11339717: 32:   11339699:
   24764:  1322112.. 1322143:   11339703..  11339734: 32:   11339718:
   24765:  1322144.. 1322175:   11339720..  11339751: 32:   11339735:
   24766:  1322176.. 1322207:   11339737..  11339768: 32:   11339752:
   24767:  1322208.. 1322239:   11339754..  11339785: 32:   11339769:
   24768:  1322240.. 1322271:   11339771..  11339802: 32:   11339786:
   24769:  1322272.. 1322303:   11339789..  11339820: 32:   11339803:

 But again this is on a fresh empty Btrfs volume.

As I wrote, "their size is just limited by the compression code"
which results in "128KiB writes". On a "fresh empty Btrfs volume"
the compressed extents limited to 128KiB also happen to be pretty
physically contiguous, but on a more fragmented free space list
they can be more scattered.

As I already wrote the main issue here seems to be that we are
talking about a "RAID5 with 128KiB writes and a 768KiB stripe
size". On MD RAID5 the slowdown because of RMW seems only to be
around 30-40%, but it looks like that several back-to-back 128KiB
writes get merged by the Linux IO subsystem (not sure whether
that's thoroughly legal), and perhaps they get merged by the 3ware
firmware only if it has a persistent cache, and maybe your 3ware
does not have one, but you have kept your counsel as to that.

My impression is that you read the Btrfs documentation and my
replies with a lot less attention than I write them. Some of the
things you have done and said make me think that you did not read
https://btrfs.wiki.kernel.org/index.php/Compression and 'man 5
btrfs', for example:

   "How does compression interact with direct IO or COW?

 Compression does not work with DIO, does work with COW and
 does not work for NOCOW files. If a file is opened in DIO
 mode, it will fall back to buffered IO.

   Are there speed penalties when doing random access to a
   compressed file?

 Yes. The compression processes ranges of a file of maximum
 size 128 KiB and compresses each 4 KiB (or page-sized) block
 separately."

> I am currently defragmenting that mountpoint, ensuring that
> everrything is compressed with zlib.

Defragmenting the used space might help find more contiguous
allocations.

> p.s. any other suggestion that might help with the fragmentation
> and data allocation. Should I try and rebalance the data on the
> drive?

Yes, regularly, as that defragments the unused space.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs incremental send | receive fails with Error: File not found

2017-08-01 Thread A L

OK. The problem was that the original subvolume had a "Received UUID". 
This caused all subsequent snapshots to have the same Received UUID 
which messes up Btrfs send | receive. Of course this means I must have 
used btrfs send | receive to create that subvolume and then turned it 
r/w at some point, though I cannot remember ever doing this.


Perhaps a clear notice "WARNING: make sure that the source subvolume 
does not have a Received UUID" on the Wiki would be helpful? Both on 
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup and on 
https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-property


Regards,
A


On 7/28/2017 9:32 PM, Hermann Schwärzler wrote:

Hi

for me it looks like those snapshots are not read-only. But as far as 
I know for using send they have to be.


At least
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup#Initial_Bootstrapping 


states "We will need to create a read-only snapshot ,,,"

I am using send/receive (with read-only snapshots) on a regular basis 
and never had a problem like yours.


What are the commands you use to create your snapshots?

Greetings
Hermann

On 07/28/2017 07:26 PM, A L wrote:
I often hit the following error when doing incremental btrfs 
send-receive:

Btrfs incremental send | receive fails with Error: File not found

Sometimes I can do two-three incremental snapshots, but then the same
error (different file) happens again. It seems that the files were
changed or replaced between snapshots, which is causing the problems for
send-receive. I have tried to delete all snapshots and started over but
the problem comes back, so I think it must be a bug.

The source volume is:   /mnt/storagePool (with RAID1 profile)
with subvolume:   volume/userData
Backup disk is:   /media/usb-backup (external USB disk)

[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Slow mounting raid1

2017-08-01 Thread E V

On Tue, Aug 1, 2017 at 2:43 AM, Leonidas Spyropoulos
 wrote:
> Hi Duncan,
>
> Thanks for your answer

In general I think btrfs takes time proportional to the size of your
metadata to mount. Bigger and/or fragmented metadata leads to longer
mount times. My big backup fs with >300GB of metadata takes over
20minutes to mount, and that's with the space tree which is
significantly faster then space cache v1.

>>
>> If a device takes too long and times out you'll see resets and the like
>> in dmesg, but that normally starts at ~30 seconds, not the 5 seconds you
>> mention.  Still, doesn't hurt to check.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Massive loss of disk space

2017-08-01 Thread Hugo Mills

   Hi, Per,

   Start here:

https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29

   In your case, I'd suggest using "-dusage=20" to start with, as
it'll probably free up quite a lot of your existing allocation.

And this may also be of interest, in how to read the output of the
tools:

https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools

   Finally, I note that you've still got some "single" chunks present
for metadata. It won't affect your space allocation issues, but I
would recommend getting rid of them anyway:

# btrfs balance start -mconvert=dup,soft

   Hugo.

On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote:
> I have a 10TB file system with a parity file for a snapraid.
> However, I can suddenly not extend the parity file despite the file
> system only being about 50% filled - I should have 5TB of
> unallocated space. When trying to extend the parity file,
> fallocate() just returns ENOSPC, i.e. that the disk is full.
> 
> Machine was originally a Debian 8 (Jessie) but after I detected the
> issue and no btrfs tool did show any errors, I have updated to
> Debian 9 (Snatch) to get a newer kernel and newer btrfs tools.
> 
> pwm@europium:/mnt$ btrfs --version
> btrfs-progs v4.7.3
> pwm@europium:/mnt$ uname -a
> Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2
> (2017-06-26) x86_64 GNU/Linux
> 
> 
> 
> 
> pwm@europium:/mnt/snap_04$ ls -l
> total 4932703608
> -rw--- 1 root root 319148889 Jul  8 04:21 snapraid.content
> -rw--- 1 root root 283115520 Aug  1 04:08 snapraid.content.tmp
> -rw--- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity
> 
> 
> 
> pwm@europium:/mnt/snap_04$ df .
> Filesystem  1K-blocks   Used  Available Use% Mounted on
> /dev/sdg1  9766434816 4944614648 4819831432  51% /mnt/snap_04
> 
> 
> 
> pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
> Total devices 1 FS bytes used 4.60TiB
> devid1 size 9.09TiB used 9.09TiB path /dev/sdg1
> 
> Compare this with the second snapraid parity disk:
> pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
> Label: 'snap_05'  uuid: bac477e3-e78c-43ee-8402-6bdfff194567
> Total devices 1 FS bytes used 4.69TiB
> devid1 size 9.09TiB used 4.70TiB path /dev/sdi1
> 
> So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB.
> While almost the same amount of file system usage. And almost
> identical usage pattern. It's an archival RAID, so there is hardly
> any writes to the parity files because there are almost no file
> changes to the data files. The main usage is that the parity file
> gets extended when one of the data disks reaches a new high water
> mark.
> 
> The only file that gets regularly rewritten is the snapraid.content
> file that gets regenerated after every scrub.
> 
> 
> 
> pwm@europium:/mnt/snap_04$ sudo btrfs fi df .
> Data, single: total=9.08TiB, used=4.59TiB
> System, DUP: total=8.00MiB, used=992.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=6.00GiB, used=4.81GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> 
> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du .
>  Total   Exclusive  Set shared  Filename
>4.59TiB 4.59TiB   -  ./snapraid.parity
>  304.37MiB   304.37MiB   -  ./snapraid.content
>  270.00MiB   270.00MiB   -  ./snapraid.content.tmp
>4.59TiB 4.59TiB   0.00B  .
> 
> 
> 
> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage .
> Overall:
> Device size:   9.09TiB
> Device allocated:  9.09TiB
> Device unallocated:  0.00B
> Device missing:  0.00B
> Used:  4.60TiB
> Free (estimated):  4.49TiB  (min: 4.49TiB)
> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
> 
> Data,single: Size:9.08TiB, Used:4.59TiB
>/dev/sdg1   9.08TiB
> 
> Metadata,single: Size:8.00MiB, Used:0.00B
>/dev/sdg1   8.00MiB
> 
> Metadata,DUP: Size:6.00GiB, Used:4.81GiB
>/dev/sdg1  12.00GiB
> 
> System,single: Size:4.00MiB, Used:0.00B
>/dev/sdg1   4.00MiB
> 
> System,DUP: Size:8.00MiB, Used:992.00KiB
>/dev/sdg1  16.00MiB
> 
> Unallocated:
>/dev/sdg1 0.00B
> 
> 
> 
> pwm@europium:~$ sudo btrfs check /dev/sdg1
> Checking filesystem on /dev/sdg1
> UUID: c46df8fa-03db-4b32-8beb-5521d9931a31
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 5057294639104 bytes used err is 0
> total csum bytes: 4529856120
> total tree bytes: 5170151424
> total fs tree bytes: 178700288
> total extent tree bytes: 209616896
> btree space waste bytes: 182357204
> file data blocks

Massive loss of disk space

2017-08-01 Thread pwm

I have a 10TB file system with a parity file for a snapraid. However, I 
can suddenly not extend the parity file despite the file system only being 
about 50% filled - I should have 5TB of unallocated space. When trying to 
extend the parity file, fallocate() just returns ENOSPC, i.e. that the 
disk is full.


Machine was originally a Debian 8 (Jessie) but after I detected the issue 
and no btrfs tool did show any errors, I have updated to Debian 9 (Snatch) 
to get a newer kernel and newer btrfs tools.


pwm@europium:/mnt$ btrfs --version
btrfs-progs v4.7.3
pwm@europium:/mnt$ uname -a
Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) 
x86_64 GNU/Linux





pwm@europium:/mnt/snap_04$ ls -l
total 4932703608
-rw--- 1 root root 319148889 Jul  8 04:21 snapraid.content
-rw--- 1 root root 283115520 Aug  1 04:08 snapraid.content.tmp
-rw--- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity



pwm@europium:/mnt/snap_04$ df .
Filesystem  1K-blocks   Used  Available Use% Mounted on
/dev/sdg1  9766434816 4944614648 4819831432  51% /mnt/snap_04



pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
Total devices 1 FS bytes used 4.60TiB
devid1 size 9.09TiB used 9.09TiB path /dev/sdg1

Compare this with the second snapraid parity disk:
pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
Label: 'snap_05'  uuid: bac477e3-e78c-43ee-8402-6bdfff194567
Total devices 1 FS bytes used 4.69TiB
devid1 size 9.09TiB used 4.70TiB path /dev/sdi1

So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB.
While almost the same amount of file system usage. And almost identical 
usage pattern. It's an archival RAID, so there is hardly any writes to the 
parity files because there are almost no file changes to the data files. 
The main usage is that the parity file gets extended when one of the data 
disks reaches a new high water mark.


The only file that gets regularly rewritten is the snapraid.content file 
that gets regenerated after every scrub.




pwm@europium:/mnt/snap_04$ sudo btrfs fi df .
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=8.00MiB, used=992.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=6.00GiB, used=4.81GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B



pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du .
 Total   Exclusive  Set shared  Filename
   4.59TiB 4.59TiB   -  ./snapraid.parity
 304.37MiB   304.37MiB   -  ./snapraid.content
 270.00MiB   270.00MiB   -  ./snapraid.content.tmp
   4.59TiB 4.59TiB   0.00B  .



pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage .
Overall:
Device size:   9.09TiB
Device allocated:  9.09TiB
Device unallocated:  0.00B
Device missing:  0.00B
Used:  4.60TiB
Free (estimated):  4.49TiB  (min: 4.49TiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:9.08TiB, Used:4.59TiB
   /dev/sdg1   9.08TiB

Metadata,single: Size:8.00MiB, Used:0.00B
   /dev/sdg1   8.00MiB

Metadata,DUP: Size:6.00GiB, Used:4.81GiB
   /dev/sdg1  12.00GiB

System,single: Size:4.00MiB, Used:0.00B
   /dev/sdg1   4.00MiB

System,DUP: Size:8.00MiB, Used:992.00KiB
   /dev/sdg1  16.00MiB

Unallocated:
   /dev/sdg1 0.00B



pwm@europium:~$ sudo btrfs check /dev/sdg1
Checking filesystem on /dev/sdg1
UUID: c46df8fa-03db-4b32-8beb-5521d9931a31
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 5057294639104 bytes used err is 0
total csum bytes: 4529856120
total tree bytes: 5170151424
total fs tree bytes: 178700288
total extent tree bytes: 209616896
btree space waste bytes: 182357204
file data blocks allocated: 5073330888704
 referenced 5052040339456



pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/
scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31
scrub started at Mon Jul 31 21:26:50 2017 and finished after 
06:53:47

total bytes scrubbed: 4.60TiB with 0 errors



So where have my 5TB disk space gone lost?
And what should I do to be able to get it back again?

I could obviously reformat the partition and rebuild the parity since I 
still have one good parity, but that doesn't feel like a good route. It 
isn't impossible this might happen again.


/Per W
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Paul Jones

> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of Konstantin V. Gavrilenko
> Sent: Tuesday, 1 August 2017 7:58 PM
> To: Peter Grandi 
> Cc: Linux fs Btrfs 
> Subject: Re: Btrfs + compression = slow performance and high cpu usage
> 
> Peter, I don't think the filefrag is showing the correct fragmentation status 
> of
> the file when the compression is used.
> At least the one that is installed by default in Ubuntu 16.04 -  e2fsprogs |
> 1.42.13-1ubuntu1
> 
> So for example, fragmentation of compressed file is 320 times more then
> uncompressed one.
> 
> root@homenas:/mnt/storage/NEW# filefrag test5g-zeroes
> test5g-zeroes: 40903 extents found
> 
> root@homenas:/mnt/storage/NEW# filefrag test5g-data
> test5g-data: 129 extents found

Compressed extents are about 128kb, uncompressed extents are about 128Mb. 
(can't remember the exact numbers.) 
I've had trouble with slow filesystems when using compression. The problem 
seems to go away when removing compression.

Paul.

[PATCH] btrfs: copy fsid to super_block s_uuid

2017-08-01 Thread Anand Jain

We didn't copy fsid to struct super_block.s_uuid so Overlay disables
index feature with btrfs as the lower FS.

kernel: overlayfs: fs on '/lower' does not support file handles, falling back 
to index=off.

Fix this by publishing the fsid through struct super_block.s_uuid.

Signed-off-by: Anand Jain 
---
I tried to know if in case did we deliberately missed this for some reason,
however there is no information on that. If we mount a non-default subvol in
the next mount/remount, its still the same FS, so publishing the FSID
instead of subvol uuid is correct, OR I can't think any other reason for
not using s_uuid for btrfs.


 fs/btrfs/disk-io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 080e2ebb8aa0..b7e72d040442 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2899,6 +2899,7 @@ int open_ctree(struct super_block *sb,
 
sb->s_blocksize = sectorsize;
sb->s_blocksize_bits = blksize_bits(sectorsize);
+   memcpy(>s_uuid, fs_info->fsid, BTRFS_FSID_SIZE);
 
mutex_lock(_info->chunk_mutex);
ret = btrfs_read_sys_array(fs_info);
-- 
2.13.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Konstantin V. Gavrilenko

Peter, I don't think the filefrag is showing the correct fragmentation status 
of the file when the compression is used.
At least the one that is installed by default in Ubuntu 16.04 -  e2fsprogs | 
1.42.13-1ubuntu1

So for example, fragmentation of compressed file is 320 times more then 
uncompressed one.

root@homenas:/mnt/storage/NEW# filefrag test5g-zeroes
test5g-zeroes: 40903 extents found

root@homenas:/mnt/storage/NEW# filefrag test5g-data 
test5g-data: 129 extents found


I am currently defragmenting that mountpoint, ensuring that everrything is 
compressed with zlib. 
# btrfs fi defragment -rv -czlib /mnt/arh-backup 

my guess is that it will take another 24-36 hours to complete and then I will 
redo the test to see if that has helped.
will keep the list posted.

p.s. any other suggestion that might help with the fragmentation and data 
allocation. Should I try and rebalance the data on the drive?

kos



- Original Message -
From: "Peter Grandi" 
To: "Linux fs Btrfs" 
Sent: Monday, 31 July, 2017 1:41:07 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage

[ ... ]

> grep 'model name' /proc/cpuinfo | sort -u 
> model name  : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz

Good, contemporary CPU with all accelerations.

> The sda device is a hardware RAID5 consisting of 4x8TB drives.
[ ... ]
> Strip Size  : 256 KB

So the full RMW data stripe length is 768KiB.

> [ ... ] don't see the previously reported behaviour of one of
> the kworker consuming 100% of the cputime, but the write speed
> difference between the compression ON vs OFF is pretty large.

That's weird; of course 'lzo' is a lot cheaper than 'zlib', but
in my test the much higher CPU time of the latter was spread
across many CPUs, while in your case it wasn't, even if the
E5645 has 6 CPUs and can do 12 threads. That seemed to point to
some high cost of finding free blocks, that is a very fragmented
free list, or something else.

> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 26.0685 s, 206 MB/s

The results with 'oflag=direct' are not relevant, because Btrfs
behaves "differently" with that.

> mountflags: 
> (rw,relatime,compress-force=zlib,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 77.4845 s, 69.3 MB/s
> mountflags: 
> (rw,relatime,compress-force=lzo,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 122.321 s, 43.9 MB/s

That's pretty good for a RAID5 with 128KiB writes and a 768KiB
stripe size, on a 3ware, and looks like that the hw host adapter
does not have a persistent cache (battery backed usually). My
guess that watching transfer rates and latencies with 'iostat
-dk -zyx 1' did not happen.

> mountflags: (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.1033 s, 531 MB/s

I had mentioned in my previous reply the output of 'filefrag'.
That to me seems relevant here, because of RAID5 RMW and maximum
extent size with Brfs compression and strip/stripe size.

Perhaps redoing the tests with a 128KiB 'bs' *without*
compression would be interesting, perhaps even with 'oflag=sync'
instead of 'conv=fsync'.

It is hard for me to see a speed issue here with Btrfs: for
comparison I have done a simple test with a both a 3+1 MD RAID5
set with a 256KiB chunk size and a single block device on
"contemporary" 1T/2TB drives, capable of sequential transfer
rates of 150-190MB/s:

  soft#  grep -A2 sdb3 /proc/mdstat 
  md127 : active raid5 sde3[4] sdd3[2] sdc3[1] sdb3[0]
729808128 blocks super 1.0 level 5, 256k chunk, algorithm 2 [4/4] []

with compression:

  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 
/mnt/test5   
  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile 
bs=1M count=1 conv=fsync
  1+0 records in
  1+0 records out
  1048576 bytes (10 GB) copied, 94.3605 s, 111 MB/s
  0.01user 12.59system 1:34.36elapsed 13%CPU (0avgtext+0avgdata 
2932maxresident)k
  13042144inputs+20482144outputs (3major+345minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile 
bs=1M count=1 conv=fsync
  1+0 records in
  1+0 records out
  1048576 bytes (10 GB) copied, 93.5885 s, 112 MB/s
  0.03user 12.35system 1:33.59elapsed 13%CPU (0avgtext+0avgdata 
2940maxresident)k
  13042144inputs+20482400outputs

Re: Slow mounting raid1

2017-08-01 Thread Leonidas Spyropoulos

Hi Duncan,

Thanks for your answer
On 01/08/17, Duncan wrote:
> 
> If you're doing any snapshotting, you almost certainly want noatime, not 
> the default relatime.  Even without snapshotting and regardless of the 
> filesystem, tho on btrfs it's a bigger factor due to COW, noatime is a 
> recommended performance optimization.
> 
> The biggest caveat with that is if you're running something that actually 
> depends on atime.  Few if any modern applications depend on atime, with 
> mutt in some configurations being an older application that still does.  
> But AFAIK it only does in some configurations...
The array has no snapshots and my mutt resides on a diff SSD btrfs so I can
safely try this option.

> 
> Is there anything suspect in dmesg during the mount?  What does smartctl 
> say about the health of the devices?  (smartctl -AH at least, the selftest 
> data is unlikely to be useful unless you actually run the selftests.)
>
dmesg while mount says:
  [19823.896790] BTRFS info (device sde): use lzo compression
  [19823.896798] BTRFS info (device sde): disk space caching is enabled
  [19823.896800] BTRFS info (device sde): has skinny extents

Smartctl tests are scheduled to run all disks once every day (for short test) 
and every week for long tests.
smartctl output:
  # smartctl -AH /dev/sdd
  smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.11.12-1-ck] (local build)
  Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
  
  === START OF READ SMART DATA SECTION ===
  SMART overall-health self-assessment test result: PASSED
  
  SMART Attributes Data Structure revision number: 16
  Vendor Specific SMART Attributes with Thresholds:
  ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b   100   100   016Pre-fail  Always  
 -   0
2 Throughput_Performance  0x0005   143   143   054Pre-fail  Offline 
 -   67
3 Spin_Up_Time0x0007   124   124   024Pre-fail  Always  
 -   185 (Average 185)
4 Start_Stop_Count0x0012   100   100   000Old_age   Always  
 -   651
5 Reallocated_Sector_Ct   0x0033   100   100   005Pre-fail  Always  
 -   0
7 Seek_Error_Rate 0x000b   100   100   067Pre-fail  Always  
 -   0
8 Seek_Time_Performance   0x0005   110   110   020Pre-fail  Offline 
 -   36
9 Power_On_Hours  0x0012   100   100   000Old_age   Always  
 -   4594
   10 Spin_Retry_Count0x0013   100   100   060Pre-fail  Always  
 -   0
   12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always  
 -   353
  192 Power-Off_Retract_Count 0x0032   094   094   000Old_age   Always  
 -   7671
  193 Load_Cycle_Count0x0012   094   094   000Old_age   Always  
 -   7671
  194 Temperature_Celsius 0x0002   162   162   000Old_age   Always  
 -   37 (Min/Max 17/62)
  196 Reallocated_Event_Count 0x0032   100   100   000Old_age   Always  
 -   0
  197 Current_Pending_Sector  0x0022   100   100   000Old_age   Always  
 -   0
  198 Offline_Uncorrectable   0x0008   100   100   000Old_age   Offline 
 -   0
  199 UDMA_CRC_Error_Count0x000a   200   200   000Old_age   Always  
 -   0

  # smartctl -AH /dev/sde
  smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.11.12-1-ck] (local build)
  Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
  
  === START OF READ SMART DATA SECTION ===
  SMART overall-health self-assessment test result: PASSED
  
  SMART Attributes Data Structure revision number: 16
  Vendor Specific SMART Attributes with Thresholds:
  ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b   100   100   016Pre-fail  Always  
 -   0
2 Throughput_Performance  0x0005   142   142   054Pre-fail  Offline 
 -   69
3 Spin_Up_Time0x0007   123   123   024Pre-fail  Always  
 -   186 (Average 187)
4 Start_Stop_Count0x0012   100   100   000Old_age   Always  
 -   709
5 Reallocated_Sector_Ct   0x0033   100   100   005Pre-fail  Always  
 -   0
7 Seek_Error_Rate 0x000b   100   100   067Pre-fail  Always  
 -   0
8 Seek_Time_Performance   0x0005   113   113   020Pre-fail  Offline 
 -   35
9 Power_On_Hours  0x0012   100   100   000Old_age   Always  
 -   4678
   10 Spin_Retry_Count0x0013   100   100   060Pre-fail  Always  
 -   0
   12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always  
 -   353
  192 Power-Off_Retract_Count 0x0032   093   093   000Old_age   Always  
 -   8407
  193 Load_Cycle_Count0x0012   093   093   000Old_age

Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)

2017-08-01 Thread Marc MERLIN

On Mon, Jul 31, 2017 at 03:00:53PM -0700, Justin Maggard wrote:
> Marc, do you have quotas enabled?  IIRC, you're a send/receive user.
> The combination of quotas and btrfs receive can corrupt your
> filesystem, as shown by the xfstest I sent to the list a little while
> ago.

Thanks for checking. I do not use quota given the problems I had with
them early on over 2y ago.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS error: bad tree block start 0 623771648

2017-08-01 Thread Roman Mamedov

On Sun, 30 Jul 2017 18:14:35 +0200
"marcel.cochem"  wrote:

> I am pretty sure that not all data is lost as i can grep thorugh the
> 100 GB SSD partition. But my question is, if there is a tool to rescue
> all (intact) data and maybe have only a few corrupt files which can't
> be recovered.

There is such a tool, see https://btrfs.wiki.kernel.org/index.php/Restore

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS error: bad tree block start 0 623771648

2017-08-01 Thread Roman Mamedov

On Mon, 31 Jul 2017 11:12:01 -0700
Liu Bo  wrote:

> Superblock and chunk tree root is OK, looks like the header part of
> the tree root is now all-zero, but I'm unable to think of a btrfs bug
> which can lead to that (if there is, it is a serious enough one)

I see that the FS is being mounted with "discard". So maybe it was a TRIM gone
bad (wrong location or in a wrong sequence).

Generally it appears to be not recommended to use "discard" by now (because of
its performance impact, and maybe possible issues like this), instead schedule
to call "fstrim " once a day or so, and/or on boot-up.

> on ssd like disks, by default there is only one copy for metadata.

Time and time again, the default of "single" metadata for SSD is a terrible
idea. Most likely DUP metadata would save the FS in this case.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

76 matches

Mail list logo