from:"Duncan"

Re: HELP unmountable partition after btrfs balance to RAID0

2018-12-07 Thread Duncan

Thomas Mohr posted on Thu, 06 Dec 2018 12:31:15 +0100 as excerpted:

> We wanted to convert a file system to a RAID0 with two partitions.
> Unfortunately we had to reboot the server during the balance operation
> before it could complete.
> 
> Now following happens:
> 
> A mount attempt of the array fails with following error code:
> 
> btrfs recover yields roughly 1.6 out of 4 TB.

[Just another btrfs user and list regular, not a dev.  A dev may reply to 
your specific case, but meanwhile, for next time...]

That shouldn't be a problem.  Because with raid0 a failure of any of the 
components will take down the entire raid, making it less reliable than a 
single device, raid0 (in general, not just btrfs) is considered only 
useful for data of low enough value that its loss is no big deal, either 
because it's truly of little value (internet cache being a good example), 
or because backups are kept available and updated for whenever the raid0 
array fails.  Because with raid0, it's always a question of when it'll 
fail, not if.

So loss of a filesystem being converted to raid0 isn't a problem, because 
the data on it, by virtue of being in the process of conversion to raid0, 
is defined as of throw-away value in any case.  If it's of higher value 
than that, it's not going to be raid0 (or in the process of conversion to 
it) in the first place.

Of course that's simply an extension of the more general first sysadmin's 
rule of backups, that the true value of data is defined not by arbitrary 
claims, but by the number of backups of that data it's worth having.  
Because "things happen", whether it's fat-fingering, bad hardware, buggy 
software, or simply someone tripping over the power cable or running into 
the power pole outside at the wrong time.

So no backup is simply defining the data as worth less than the time/
trouble/resources necessary to make that backup.

Note that you ALWAYS save what was of most value to you, either the time/
trouble/resources to do the backup, if your actions defined that to be of 
more value than the data, or the data, if you had that backup, thereby 
defining the value of the data to be worth backing up.

Similarly, failure of the only backup isn't a problem because by virtue 
of there being only that one backup, the data is defined as not worth 
having more than one, and likewise, having an outdated backup isn't a 
problem, because that's simply the special case of defining the data in 
the delta between the backup time and the present as not (yet) worth the 
time/hassle/resources to make/refresh that backup.

(And FWIW, the second sysadmin's rule of backups is that it's not a 
backup until you've successfully tested it recoverable in the same sort 
of conditions you're likely to need to recover it in.  Because so many 
people have /thought/ they had backups, that turned out not to be, 
because they never tested that they could actually recover the data from 
them.  For instance, if the backup tools you'll need to recover the 
backup are on the backup itself, how do you get to them?  Can you create 
a filesystem for the new copy of the data and recover it from the backup 
with just the tools and documentation available from your emergency boot 
media?  Untested backup == no backup, or at best, backup still in 
process!)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: unable to fixup (regular) error

2018-11-26 Thread Duncan

Alexander Fieroch posted on Mon, 26 Nov 2018 11:23:00 +0100 as excerpted:

> Am 26.11.18 um 09:13 schrieb Qu Wenruo:
>> The corruption itself looks like some disk error, not some btrfs error
>> like transid error.
> 
> You're right! SMART has an increased value for one harddisk on
> reallocated sector count. Sorry, I missed to check this first...
> 
> I'll try to salvage my data...

FWIW as a general note about raid0 for updating your layout...

Because raid0 is less reliable than a single device (failure of any 
device of the raid0 is likely to take it out, and failure of any one of N 
is more likely than failure of any specific single device), admins 
generally consider it useful only for "throw-away" data, that is, data 
that can be lost without issue either because it really /is/ "throw-
away" (internet cache being a common example), or because it is 
considered a "throw-away" copy of the "real" data stored elsewhere, with 
that "real" copy being either the real working copy of which the raid0 is 
simply a faster cache, or with the raid0 being the working copy, but with 
sufficiently frequent backup updates that if the raid0 goes, it won't 
take anything of value with it (read as the effort to replace any data 
lost will be reasonably trivial, likely only a few minutes or hours, at 
worst perhaps a day's worth, of work, depending on how many people's work 
is involved and how much their time is considered to be worth).

So if it's raid0, you shouldn't be needing to worry about trying to 
recover what's on it, and probably shouldn't even be trying to run a 
btrfs check on it at all as it's likely to be more trouble and take more 
time than the throw-away data on it is worth.  If something goes wrong 
with a raid0, just declare it lost, blow it away and recreate fresh, 
restoring from the "real" copy if necessary.  Because for an admin, 
really with any data but particularly for a raid0, it's more a matter of 
when it'll die than if.

If that's inappropriate for the value of the data and status of the 
backups/real-copies, then you should really be reconsidering whether 
raid0 of any sort is appropriate, because it almost certainly is not.


For btrfs, what you might try instead of raid0, is raid1 metadata at 
least, raid0 or single mode data if there's not room enough to do raid1 
data as well.  And the raid1 metadata would have very likely saved the 
filesystem in this case, with some loss of files possible depending on 
where the damage is, but with the second copy of the metadata from the 
good device being used to fill in for and (attempt to, if the bad device 
is actively getting worse it might be a losing battle) repair any 
metadata damage on the bad device, thus giving you a far better chance of 
saving the filesystem as a whole.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Filesystem mounts fine but hangs on access

2018-11-04 Thread Duncan

Adam Borowski posted on Sun, 04 Nov 2018 20:55:30 +0100 as excerpted:

> On Sun, Nov 04, 2018 at 06:29:06PM +0000, Duncan wrote:
>> So do consider adding noatime to your mount options if you haven't done
>> so already.  AFAIK, the only /semi-common/ app that actually uses
>> atimes these days is mutt (for read-message tracking), and then not for
>> mbox, so you should be safe to at least test turning it off.
> 
> To the contrary, mutt uses atimes only for mbox.

Figures that I'd get it reversed.
 
>> And YMMV, but if you do use mutt or something else that uses atimes,
>> I'd go so far as to recommend finding an alternative, replacing either
>> btrfs (because as I said, relatime is arguably enough on a traditional
>> non-COW filesystem) or whatever it is that uses atimes, your call,
>> because IMO it really is that big a deal.
> 
> Fortunately, mutt's use could be fixed by teaching it to touch atimes
> manually.  And that's already done, for both forks (vanilla and
> neomutt).

Thanks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Filesystem mounts fine but hangs on access

2018-11-04 Thread Duncan

Sebastian Ochmann posted on Sun, 04 Nov 2018 14:15:55 +0100 as excerpted:

> Hello,
> 
> I have a btrfs filesystem on a single encrypted (LUKS) 10 TB drive which
> stopped working correctly.

> Kernel 4.18.16 (Arch Linux)

I see upgrading to 4.19 seems to have solved your problem, but this is 
more about something I saw in the trace that has me wondering...

> [  368.267315]  touch_atime+0xc0/0xe0

Do you have any atime-related mount options set?

FWIW, noatime is strongly recommended on btrfs.

Now I'm not a dev, just a btrfs user and list regular, and I don't know 
if that function is called and just does nothing when noatime is set, so 
you may well already have it set and this is "much ado about nothing", 
but the chance that it's relevant, if not for you, perhaps for others 
that may read it, begs for this post...

The problem with atime, access time, is that it turns most otherwise read-
only operations into read-and-write operations in ordered to update the 
access time.  And on copy-on-write (COW) based filesystems such as btrfs, 
that can be a big problem, because updating that tiny bit of metadata 
will trigger a rewrite of the entire metadata block containing it, which 
will trigger an update of the metadata for /that/ block in the parent 
metadata tier... all the way up the metadata tree, ultimately to its 
root, the filesystem root and the superblocks, at the next commit 
(normally every 30 seconds or less).

Not only is that a bunch of otherwise unnecessary work for a bit of 
metadata barely anything actually uses, but forcing most read operations 
to read-write obviously compounds the risk for all of those would-be read-
only operations when a filesystem already has problems.

Additionally, if your use-case includes regular snapshotting, with atime 
on, on mostly read workloads with few writes (other than atime updates), 
it may actually be the case that most of the changes in a snapshot are 
actually atime updates, making reoccurring snapshot updates far larger 
than they'd be otherwise.

Now a few years ago the kernel did change the default to relatime, 
basically updating the atime for any particular file only once a day, 
which does help quite a bit, and on traditional filesystems it's arguably 
a reasonably sane default, but COW makes atime tracking enough more 
expensive that setting noatime is still strongly recommended on btrfs, 
particularly if you're doing regular snapshotting.

So do consider adding noatime to your mount options if you haven't done 
so already.  AFAIK, the only /semi-common/ app that actually uses atimes 
these days is mutt (for read-message tracking), and then not for mbox, so 
you should be safe to at least test turning it off.

And YMMV, but if you do use mutt or something else that uses atimes, I'd 
go so far as to recommend finding an alternative, replacing either btrfs 
(because as I said, relatime is arguably enough on a traditional non-COW 
filesystem) or whatever it is that uses atimes, your call, because IMO it 
really is that big a deal.

Meanwhile, particularly after seeing that in the trace, if the 4.19 
update hadn't already fixed it, I'd have suggested trying a read-only 
mount, both as a test, and assuming it worked, at least allowing you to 
access the data without the lockup, which would have then been related to 
the write due to the atime update, not the actual read.

Actually, a read-only mount test is always a good troubleshooting step 
when the trouble is a filesystem that either won't mount normally, or 
will, but then locks up when you try to access something.  It's far lest 
risky than a normal writable mount, and at minimum it provides you the 
additional test data of whether it worked or not, plus if it does, a 
chance to access the data and make sure your backups are current, before 
actually trying to do any repairs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: BTRFS did it's job nicely (thanks!)

2018-11-03 Thread Duncan

waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted:

> Note that I tend to interpret the btrfs de st / output as if the error
> was NOT fixed even if (seems clearly that) it was, so I think the output
> is a bit misleading... just saying...

See the btrfs-device manpage, stats subcommand, -z|--reset option, and 
device stats section:

-z|--reset
Print the stats and reset the values to zero afterwards.

DEVICE STATS
The device stats keep persistent record of several error classes related 
to doing IO. The current values are printed at mount time and
updated during filesystem lifetime or from a scrub run.


So stats keeps a count of historic errors and is only reset when you 
specifically reset it, *NOT* when the error is fixed.

(There's actually a recent patch, I believe in the current dev kernel 
4.20/5.0, that will reset a device's stats automatically for the btrfs 
replace case when it's actually a different device afterward anyway.  
Apparently, it doesn't even do /that/ automatically yet.  Keep that in 
mind if you replace that device.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Understanding BTRFS RAID0 Performance

2018-10-05 Thread Duncan

ee the mkfs.btrfs manpage for the details as 
there's a tradeoff, smaller sizes increase (metadata) fragmentation but 
decrease lock contention, while larger sizes pack more efficiently and 
are less fragmented but updating is more expensive.  The change in 
default was because 16 KiB was a win over the old 4 KiB for most use-
cases, but the 32 or 64 KiB options may or may not be, depending on use-
case, and of course if you're bottlenecking on locks, 4 KiB may still be 
a win.


Among all those, I'd be especially interested in what thread_pool=n does 
or doesn't do for you, both because it specifically mentions 
parallelization and because I've seen little discussion of it.

space_cache=v2 may also be a big boost for you, if you're filesystems are 
the size the 6-device raid0 implies and are at all reasonably populated.

(Metadata) nodesize may or may not make a difference, tho I suspect if so 
it'll be mostly on writes (but I'm not familiar with the specifics there 
so could be wrong).  I'd be interested to see if it does.

In general I can recommend the no_holes and skinny_metadata features but 
you may well already have them, and the noatime mount option, which you 
may well already be using as well.  Similarly, I ensure that all my btrfs 
are mounted from first mount with autodefrag, so it's always on as the 
filesystem is populated, but I doubt you'll see a difference from that in 
your benchmarks unless you're specifically testing an aged filesystem 
that would be heavily fragmented on its own.


There's one guy here who has done heavy testing on the ssd stuff and 
knows btrfs on-device chunk allocation strategies very well, having come 
up with a utilization visualization utility and been the force behind the 
relatively recent (4.16-ish) changes to the ssd mount option's allocation 
strategy.  He'd be the one to talk to if you're considering diving into 
btrfs' on-disk allocation code, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Understanding BTRFS RAID0 Performance

2018-10-05 Thread Duncan

Wilson, Ellis posted on Thu, 04 Oct 2018 21:33:29 + as excerpted:

> Hi all,
> 
> I'm attempting to understand a roughly 30% degradation in BTRFS RAID0
> for large read I/Os across six disks compared with ext4 atop mdadm
> RAID0.
> 
> Specifically, I achieve performance parity with BTRFS in terms of
> single-threaded write and read, and multi-threaded write, but poor
> performance for multi-threaded read.  The relative discrepancy appears
> to grow as one adds disks.

[...]

> Before I dive into the BTRFS source or try tracing in a different way, I
> wanted to see if this was a well-known artifact of BTRFS RAID0 and, even
> better, if there's any tunables available for RAID0 in BTRFS I could
> play with.  The man page for mkfs.btrfs and btrfstune in the tuning
> regard seemed...sparse.

This is indeed well known for btrfs at this point, as it hasn't been 
multi-read-thread optimized yet.  I'm personally more familiar with the 
raid1 case, where which one of the two copies gets the read is simply 
even/odd-PID-based, but AFAIK raid0 isn't particularly optimized either.

The recommended workaround is (as you might expect) btrfs on top of 
mdraid.  In fact, while it doesn't apply to your case, btrfs raid1 on top 
of mdraid0s is often recommended as an alternative to btrfs raid10, as 
that gives you the best of both worlds -- the data and metadata integrity 
protection of btrfs checksums and fallback (with writeback of the correct 
version) to the other copy if the first copy read fails checksum 
verification, with the much better optimized mdraid0 performance.  So it 
stands to reason that the same recommendation would apply to raid0 -- 
just do single-mode btrfs on mdraid0, for better performance than the as 
yet unoptimized btrfs raid0.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Transaction aborted error -28 clone_finish_inode_update

2018-10-05 Thread Duncan

David Goodwin posted on Thu, 04 Oct 2018 17:44:46 +0100 as excerpted:

> While trying to run/use bedup ( https://github.com/g2p/bedup )  I
> hit this :
> 
> 
> [Thu Oct  4 15:34:51 2018] [ cut here ]
> [Thu Oct  4 15:34:51 2018] BTRFS: Transaction aborted (error -28)
> [Thu Oct  4 15:34:51 2018] WARNING: CPU: 0 PID: 28832 at
> fs/btrfs/ioctl.c:3671 clone_finish_inode_update+0xf3/0x140 

> [Thu Oct  4 15:34:51 2018] CPU: 0 PID: 28832 Comm: bedup Not tainted
> 4.18.10-psi-dg1 #1

[snipping a bunch of stuff that I as a non-dev list regular can't do much 
with anyway]

> [Thu Oct  4 15:34:51 2018] BTRFS: error (device xvdg) in
> clone_finish_inode_update:3671: errno=-28 No space left
> [Thu Oct  4 15:34:51 2018] BTRFS info (device xvdg): forced readonly 

> % btrfs fi us /filesystem/
> Overall:
>      Device size:           7.12TiB
>  Device allocated:      6.80TiB
>  Device unallocated:330.93GiB
>  Device missing:    0.00B
>  Used:              6.51TiB
>  Free (estimated):  629.87GiB    (min: 629.87GiB)
>      Data ratio:            1.00
>  Metadata ratio:        1.00
>  Global reserve:        512.00MiB    (used: 0.00B)
> 
> Data+Metadata,single: Size:6.80TiB, Used:6.51TiB
>     /dev/xvdf       1.69TiB
> /dev/xvdg       3.12TiB
> /dev/xvdi       1.99TiB
> 
> System,single: Size:32.00MiB, Used:780.00KiB
>     /dev/xvdf      32.00MiB
> 
> Unallocated:
>     /dev/xvdf     320.97GiB
> /dev/xvdg     949.00MiB
> /dev/xvdi       9.03GiB
> 
> 
> I kind of think there is sufficient free space. at least globally
> within the filesystem.
> 
> Does it require balancing to redistribute the unallocated space better?
> Or is something misbehaving?

The latter, but unfortunately there's not much you can do about it at 
this point but wait for fixes, unless you want to spit up that huge 
filesystem into several smaller ones.

In general, btrfs has at least four kinds of "space" that it can run out 
of, tho in your case it appears you're running mixed-mode so data and 
metadata space are combined into one.

* Unallocated space:  This is space that remains entirely unallocated in 
the filesystem.  It matters most when the balance between data and 
metadata space gets off.

This isn't a problem for you as in single mode space can be allocated 
from any device and you have one with hundreds of gigs unallocated.  It 
also tends to be less of a problem on mixed-bg mode, which you're 
running, as there's no distinction in mixed-mode between data and 
metadata.

* Data chunk space:
* Metadata chunk space:

Because you're running mixed-bg mode, there's no distinction between 
these two, but for normal mode, running out of one or the other while all 
the free space is allocated to chunks of the other type, can be a problem.

* Global reserve:  Taken from metadata, the global reserve is space the 
system won't normally use, that it tries to keep clear in ordered to be 
able to finish transactions once they're started, as btrfs' copy-on-write 
semantics means even deleting stuff requires a bit of additional space 
temporarily.

This seems to actually be where the problem is, because currently, 
certain btrfs operations such as reflinking/cloning/snapshotting (that 
is, just what you were doing) don't really calculate the needed space 
correctly and use arbitrary figures, which can be *wildly* off, while 
conversely a bare half-gig of global-reserve for a huge 7+ TiB filesystem 
seems rather proportionally small.  (Consider that my small pair-device 
btrfs raid1 root filesystem, 8-GiB/device, 16 GiB total, has a 16 MiB 
reserve, proportionally, your 7+ TB filesystem would have 7+ GiB reserve, 
but it only has a half GiB.)

So relatively small btrfs' don't tend to run into the problem, because 
they have proportionally larger reserves to begin with.  Plus they 
probably don't have proportionally as many snapshots/reflinks/etc, 
either, so the problem simply doesn't trigger for them.

Now I'm not a dev and my own use-case doesn't include either snapshotting 
or deduping, so I haven't paid that much attention to the specifics, but 
I have seen some recent patches on-list that based on the explanations 
should go some way toward fixing this problem by using more realistic 
figures for global-reserve calculations.  At this point those patches 
would be for 4.20 (which might be 5.0), or possibly 4.21, but the devs 
are indeed working on the problem and it should get better within a 
couple kernel cycles.

Alternatively perhaps the global reserve size could be bumped up on such 
large filesystems, but let's see if the more realistic operations-reserve 
calculations can fix things, first, as arguably that shouldn't be 
necessary once the calculations aren't so arbitrarily wild.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: What to do with damaged root fllesystem (opensuse leap 42.2)

2018-10-05 Thread Duncan

 the btrfs restore, you should find a current btrfs-progs, 
4.17.1 ATM, to do it with, as that should give you the best results.  Try 
Fedora Rawhide or Arch (or the Gentoo I run), as they tend to have more 
current versions.

Then you need some place to put the scraped files, a writable filesystem 
with enough space to put what you're trying to restore.

Once you have some place to put the scraped files, with luck, it's a 
simple case of running...

btrfs restore   

... where ...

 is the damaged filesystem

 is the path on the writable filesystem where you want to dump the 
restored files

and  can include various options as found in the btrfs-restore 
manpage, like -m/--metadata if you want to try to restore owner/times/
perms for the files, -s/--symlinks if you want to try to restore them, 
-x/--xattr if you want to try to restore them, etc.

You may want to do a dry-run with -D/--dry-run first, to get some idea of 
whether it's looking like it can restore many of the files or not, and 
thus, of the sort of free space you may need on the writable filesystem 
to store the files it can restore.


If a simple btrfs restore doesn't seem to get anything, there is an 
advanced mode as well, with a link to the wiki page covering it in the 
btrfs-restore manpage, but it does get quite technical, and results may 
vary.  You will likely need help with that if you decide to try it, but 
as they say, that's a bridge we can cross when/if we get to it, no need 
to deal with it just yet.

Meanwhile, again, don't worry too much about whether you can recover 
anything here or not, because in any case you already have what was most 
important to you, either backups you can restore from if you considered 
the data worth having them, or the time and trouble you would have put 
into those backups, if you considered saving that more important than 
making the backups.  So losing the data on the filesystem, whether from 
filesystem error as seems to be the case here, due to admin fat-fingering 
(the infamous rm -rf .* or alike), or due to physical device loss if the 
disks/ssds themselves went bad, can never be a big deal, because the 
maximum value of the data in question is always strictly limited to that 
of the point at which having a backup is more important than the time/
trouble/resources you save(d) by not having one.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: btrfs problems

2018-09-22 Thread Duncan

Adrian Bastholm posted on Thu, 20 Sep 2018 23:35:57 +0200 as excerpted:

> Thanks a lot for the detailed explanation.
> Aabout "stable hardware/no lying hardware". I'm not running any raid
> hardware, was planning on just software raid. three drives glued
> together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
> this be a safer bet, or would You recommend running the sausage method
> instead, with "-d single" for safety ? I'm guessing that if one of the
> drives dies the data is completely lost Another variant I was
> considering is running a raid1 mirror on two of the drives and maybe a
> subvolume on the third, for less important stuff

Agreed with CMurphy's reply, but he didn't mention...

As I wrote elsewhere recently, don't remember if it was in a reply to you 
before you tried zfs and came back, or to someone else, so I'll repeat 
here, briefer this time...

Keep in mind that on btrfs, it's possible (and indeed the default with 
multiple devices) to run data and metadata at different raid levels.

IMO, as long as you're following an appropriate backup policy that backs 
up anything valuable enough to be worth the time/trouble/resources of 
doing so, so if you /do/ lose the array you still have a backup of 
anything you considered valuable enough to worry about (and that caveat 
is always the case, no matter where or how it's stored, value of data is 
in practice defined not by arbitrary claims but by the number of backups 
it's considered worth having of it)...

With that backups caveat, I'm now confident /enough/ about raid56 mode to 
be comfortable cautiously recommending it for data, tho I'd still /not/ 
recommend it for metadata, which I'd recommend should remain the multi-
device default raid1 level.

That way, you're only risking a limited amount of raid5 data to the not 
yet as mature and well tested raid56 mode, the metadata remains protected 
by the more mature raid1 mode, and if something does go wrong, it's much 
more likely to be only a few files lost instead of the entire filesystem, 
as is at risk if your metadata is raid56 as well, the metadata including 
checksums will be intact so scrub should tell you what files are bad, and 
if those few files are valuable they'll be on the backup and easy enough 
to restore, compared to restoring the entire filesystem.  But for most 
use-cases, metadata should be relatively small compared to data, so 
duplicating metadata as raid1 while doing raid5 for data should go much 
easier on the capacity needs than raid1 for both would.

Tho I'd still recommend raid1 data as well for higher maturity and tested 
ability to use the good copy to rewrite the bad one if one copy goes bad 
(in theory, raid56 mode can use parity to rewrite as well, but that's not 
yet as well tested and there's still the narrow degraded-mode crash write 
hole to worry about), if it's not cost-prohibitive for the amount of data 
you need to store.  But for people on a really tight budget or who are 
storing double-digit TB of data or more, I can understand why they prefer 
raid5, and I do think raid5 is stable enough for data now, as long as the 
metadata remains raid1, AND they're actually executing on a good backup 
policy.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: [RFC PATCH v2 0/4] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-09-22 Thread Duncan

Axel Burri posted on Fri, 21 Sep 2018 11:46:37 +0200 as excerpted:

> I think you got me wrong here: There will not be binaries with the same
> filename. I totally agree that this would be a bad thing, no matter if
> you have bin/sbin merged or not, you'll end up in either having a
> collision or (even worse) rely on the order in $PATH.
> 
> With this "separated" patchset, you can install a binary
> "btrfs-subvolume-show", which has the same functionality as "btrfs
> subvolume show" (note the whitespace/dash), ending up with:
> 
> /sbin/btrfs
> /usr/bin/btrfs-subvolume-show
> /usr/bin/btrfs-subvolume-list

I did get you wrong (and had even understood the separately named 
binaries from an earlier post, too, but forgot).

Thanks. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: [RFC PATCH v2 0/4] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-09-20 Thread Duncan

Axel Burri posted on Thu, 20 Sep 2018 00:02:22 +0200 as excerpted:

> Now not everybody wants to install these with fscaps or setuid, but it
> might also make sense to provide "/usr/bin/btrfs-subvolume-{show,list}",
> as they now work for a regular user. Having both root/user binaries
> concurrently is not an issue (e.g. in gentoo the full-featured btrfs
> command is in "/sbin/").

That's going to be a problem for distros (or users like me with advanced 
layouts, on gentoo too FWIW) that have the bin/sbin merge, where one is a 
symlink to the other.

FWIW I have both the /usr merge (tho reversed for me, so /usr -> . 
instead of having to have /bin and /sbin symlinks to /usr/bin) and the 
bin/sbin merge, along with, since I'm on amd64-nomultilib, the lib/lib64 
merge.  So:

$$ dir -gGd /bin /sbin /usr /lib /lib64
drwxr-xr-x 1 35688 Sep 18 22:56 /bin
lrwxrwxrwx 1 5 Aug  7 00:29 /lib -> lib64
drwxr-xr-x 1 78560 Sep 18 22:56 /lib64
lrwxrwxrwx 1 3 Mar 11  2018 /sbin -> bin
lrwxrwxrwx 1 1 Mar 11  2018 /usr -> .


Of course that last one (/usr -> .) leads to /share and /include hanging 
directly off of / as well, but it works.

But in that scheme /bin, /sbin, /usr/bin and /usr/sbin, are all the same 
dir, so only one executable of a particularly name can exist therein.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)

2018-09-20 Thread Duncan

Tomasz Chmielewski posted on Wed, 19 Sep 2018 10:43:18 +0200 as excerpted:

> I have a mysql slave which writes to a RAID-1 btrfs filesystem (with
> 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.
> 
> The slave receives around 0.5-1 MB/s of data from the master over the
> network, which is then saved to MySQL's relay log and executed. In ideal
> conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s
> of data written to disk.
> 
> MySQL directory and files in it are chattr +C (since the directory was
> created, so all files are really +C); there are no snapshots.
> 
> 
> Now, an interesting thing.
> 
> When the filesystem is mounted with these options in fstab:
> 
> defaults,noatime,discard
> 
> 
> We can see a *constant* write of 25-100 MB/s to each disk. The system is
> generally unresponsive and it sometimes takes long seconds for a simple
> command executed in bash to return.
> 
> 
> However, as soon as we remount the filesystem with space_cache=v2 -
> writes drop to just around 3-10 MB/s to each disk. If we remount to
> space_cache - lots of writes, system unresponsive. Again remount to
> space_cache=v2 - low writes, system responsive.
> 
> 
> That's a huuge, 10x overhead! Is it expected? Especially that
> space_cache=v1 is still the default mount option?

The other replies are good but I've not seen this pointed out yet...

Perhaps you are accounting for this already, but you don't /say/ you are, 
while you do mention repeatedly toggling the space-cache options, which 
would trigger it so you /need/ to account for it...

I'm not sure about space_cache=v2 (it's probably more efficient with it 
even if it does have to do it), but I'm quite sure that space_cache=v1 
takes some time after initial mount with it to scan the filesystem and 
actually create the map of available free space that is the space_cache.

Now you said ssds, which should be reasonably fast, but you also say 3-
device btrfs raid1, with each device ~2TB, and the filesystem ~40% full, 
which should be ~2 TB of data, which is likely somewhat fragmented so 
it's likely rather more than 2 TB of data chunks to scan for free space, 
and that's going to take /some/ time even on SSDs!

So if you're toggling settings like that in your tests, be sure to let 
the filesystem rebuild its cache that you just toggled and give it time 
to complete that and quiesce, before you start trying to measure write 
amplification.

Otherwise it's not write-amplification you're measuring, but the churn 
from the filesystem still trying to reset its cache after you toggled it!


Also, while 4.17 is well after the ssd mount option (usually auto-
detected, check /proc/mounts, mount output, or dmesg, to see if the ssd 
mount option is being added) fixes that went in in 4.14, if the 
filesystem has been in use for several kernel cycles and in particular 
before 4.14, with the ssd mount option active, and you've not rebalanced 
since then, you may well still have serious space fragmentation from 
that, which could increase the amount of data in the space_cache map 
rather drastically, thus increasing the time it takes to update the 
space_cache, particularly v1, after toggling it on.

A balance can help correct that, but it might well be easier and should 
result in a better layout to simply blow the filesystem away with a 
mkfs.btrfs and start over.


Meanwhile, as Remi already mentioned, you might want to reconsider nocow 
on btrfs raid1, since nocow defeats checksumming and thus scrub, which 
verifies checksums, simply skips it, and if the two copies get out of 
sync for some reason...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Duncan

Chris Murphy posted on Tue, 18 Sep 2018 13:34:14 -0600 as excerpted:

> I've run into some issue where grub2-mkconfig and grubby, can change the
> grub.cfg, and then do a really fast reboot without cleanly unmounting
> the volume - and what happens? Can't boot. The bootloader can't do log
> replay so it doesn't see the new grub.cfg at all. If all you do is mount
> the volume and unmount, log replay happens, the fs metadata is all fixed
> up just fine, and now the bootloader can see it.
> This same problem can happen with the kernel and initramfs
> installations.
> 
> (Hilariously the reason why this can happen is because of a process
> exempting itself from being forcibly killed by systemd *against* the
> documented advice of systemd devs that you should only do this for
> processes not on rootfs; but as a consequence of this process doing the
> wrong thing, systemd at reboot time ends up doing an unclean unmount and
> reboot because it won't kill the kill exempt process.)

That's... interesting!

FWIW here I use grub2, but as many admins I'm quite comfortable with 
bash, and the high-level grub2 config mechanisms simply didn't let me do 
what I needed to do.  So I had to learn the lower-level grub bash-like 
scripting language to do what I wanted to do, and I even go so far as to 
install-mask some of the higher level stuff so it doesn't get installed 
at all, and thus can't somehow run and screw up my config.

So I edit my grub scripts (and grubenv) much like I'd edit any other 
system script (and its separate config file where I have them) I might 
need to update, then save my work, and with both a bios-boot partition 
setup for grub-core and an entirely separate /boot that's not routinely 
mounted unless I'm updating it, I normally unmount it when I'm done, 
before I actually reboot.

So I've never had systemd interfere.

(And of course I have backups.  In fact, on my main personal system, with 
both the working root and its primary backup being btrfs pair-device 
raid1 on separate devices, I have four physical ssds installed, with a 
bios-boot partition with grub installed and a separate dedicated (btrfs 
dup mode) /boot on each of all four, so I have a working grub and /boot 
and three backups, each of which I can point the bios at and have tested 
separately as bootable.  So if upgrading grub or anything on /boot goes 
wrong I find that out testing the working copy, and boot one of the 
backups to resolve the problem before eventually upgrading all three 
backups after the working copy upgrade is well tested.)

> So *already* we have file systems that are becoming too complicated for
> the bootloader to reliably read, because they cannot do journal relay,
> let alone have any chance of modifying (nor would I want them to do
> this). So yeah I'm, very rapidly becoming opposed to grubenv on anything
> but super simple volumes like maybe ext4 without a journal (extents are
> nice); or even perhaps GRUB should just implement its own damn file
> system and we give it its own partition - similar to BIOS Boot - but
> probably a little bigger

You realize that solution is already standardized as EFI and its standard 
FAT filesystem, right?

=:^)

>>> but is the bootloader overwrite of gruvenv going to recompute parity
>>> and write to multiple devices? Eek!
>>
>> Recompute the parity should not be a big deal. Updating all the
>> (b)trees would be a too complex goal.
> 
> I think it's just asking for trouble. Sometimes the best answer ends up
> being no, no and definitely no.

Agreed.  I actually /like/ the fact that at the grub prompt I can rely on 
everything being read-only, and if that SuSE patch to put grubenv in the 
reserved space and make it writable gets upstreamed, I really hope 
there's a build-time configure option to disable the feature, because IMO 
grub doesn't /need/ to save state at that point, and allowing it to do so 
is effectively needlessly playing a risky Russian Roulette game with my 
storage devices.  Were it actually needed that'd be different, but it's 
not needed, so any risk is too much risk.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: btrfs panic problem

2018-09-17 Thread Duncan

tc, that you need, and for comparing against others when posted.  But 
once things go bad on you, you really want the newest btrfs-progs in 
ordered to give you the best chance at either fixing things, or worst-
case, at least retrieving the files off the dead filesystem.  So using 
the older distro btrfs-progs for routine running should be fine, but 
unless your backups are complete and frequent enough that if something 
goes wrong it's easiest to simply blow the bad version away with a fresh 
mkfs and start over, you'll probably want at least a reasonably current 
btrfs-progs on your rescue media at least.  Since the userspace version 
numbers are synced to the kernel cycle, a good rule of thumb is keep your 
btrfs-progs version to at least that of the oldest recommended LTS kernel 
version, as well, so you'd want at least btrfs-progs 4.9 on your rescue 
media, for now, and 4.14, coming up, since when the new kernel goes LTS 
that'll displace 4.9 and 4.14 will then be the second-back LTS.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: state of btrfs snapshot limitations?

2018-09-14 Thread Duncan

al 
model, wouldn't actually be much less efficient in terms of snapshot 
taking, because snapshotting is /designed/ to be fast, while at the same 
time it would significantly simplify the logic of the deletion scripts 
since they could simply delete everything older than X, instead of having 
to do conditional thinning logic.

So your scheme with period slotting and capping as opposed to simply 
timestamping and thinning, is a new thought to me, but I like the idea 
for its simplicity, and as I said, it shouldn't really "cost" more, 
because taking snapshots is fast and relatively cost-free. =:^)

I'd still recommend taking it easy on the yearly, tho, perhaps beyond a 
year or two, preferring physically media swapping and archiving at the 
yearly level if yearly archiving is found necessary at all.  And 
depending on your particular needs, physical-swap archiving at six months 
or even quarterly might actually be appropriate, especially given that 
(with spinning rust at least, I guess ssds retain best with periodic 
power-up) on-the-shelf archiving should be more dependable as a last-
resort backup.

Or do similar online with for example Amazon Glacier (never used 
personally, tho I actually have the site open for reference as I write 
this and at US $0.004 per gig per month... so say $100 for a TB for 2 
years or a couple hundred gig for a decade, $10/yr with a much better 
chance at actually being able to use it after a fire/flood/etc that'd 
take out anything local, tho actually retrieving it would cost a bit 
too... I'm actually thinking perhaps I should consider it... obviously 
I'd well encrypt first... until now I'd always done onsite backup only, 
figuring if I had a fire or something that'd be the last thing I'd be 
worried about, but now I'm actually considering...)

OK, so I guess the bottom-line answer is "it depends."  But the above 
should give you more data to plugin for your specific use-case.

But if it's pure backup, you don't expect to expand to more devices in-
place and you can blow it away and don't have to consider check --repair, 
AND you can do a couple filesystems so as to keep your daily snapshots 
separate from the more frequent backups and thus avoid snapshot deletion, 
you may actually be able to do the 365 dailies for 2-3 years then swap-
out filesystems and devices without deleting snapshots, thus avoiding any 
of the maintenance-scaling issues that are the big limitation, and have 
it work just fine.

OTOH, if you're use-case is a bit more conventional, with more 
maintenance to have to worry about scaling, capping to 100 snapshots 
remains a reasonable recommendation, and if you need quotas as well and 
can't afford to disable them even temporarily for a balance, you may find 
under 50 snapshots to be your maintenance pain tolerance threshold.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: List of known BTRFS Raid 5/6 Bugs?

2018-09-11 Thread Duncan

st use the defaults and not even be aware 
of the tradeoffs they're making by doing so, as is already the case on 
mdraid and zfs.

---
[1] As I'm no longer running either mdraid or parity-raid, I've not 
followed this extremely closely, but writing this actually spurred me to 
google the problem and see when and how mdraid fixed it.  So the links 
are from that. =:^)

[2] Journalling/journaling, one or two Ls?  The spellcheck flags both and 
last I tried googling it the answer was inconclusive.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: List of known BTRFS Raid 5/6 Bugs?

2018-09-08 Thread Duncan

eature's stability is coming, and /then/ use it, 
after factoring in its remaining then still new and less mature 
additional risk into your backup risks profile, of course.

Time?  Not a dev but following the list and obviously following the new 3-
way-mirroring, I'd say probably not 4.20 (5.0?) for the new mirroring 
modes, so 4.21/5.1 more reasonably likely (if all goes well, could be 
longer), probably another couple cycles (if all goes well) after that for 
the parity-raid logging code built on top of the new mirroring modes, so 
perhaps a year (~5 kernel cycles) to introduction for it.  Then wait 
however many cycles until you think it has stabilized.  Call that another 
year.  So say about 10 kernel cycles or two years.  It could be a bit 
less than that, say 5-7 cycles, if things go well and you take it before 
I'd really consider it stable enough to recommend, but given the 
historically much longer than predicted development and stabilization 
times for raid56 already, it could just as easily end up double that, 4-5 
years out, too.

But raid56 logging mode for write-hole mitigation is indeed actively 
being worked on right now.  That's what we know at this time.

And even before that, right now, raid56 mode should already be reasonably 
usable, especially if you do data raid5/6 and metadata raid1, as long as 
your backup policy and practice is equally reasonable.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Re-mounting removable btrfs on different device

2018-09-06 Thread Duncan

Remi Gauvin posted on Thu, 06 Sep 2018 20:54:17 -0400 as excerpted:

> I'm trying to use a BTRFS filesystem on a removable drive.
> 
> The first drive drive was added to the system, it was /dev/sdb
> 
> Files were added and device unmounted without error.
> 
> But when I re-attach the drive, it becomes /dev/sdg (kernel is fussy
> about re-using /dev/sdb).
> 
> btrfs fi show: output:
> 
> Label: 'Archive 01'  uuid: 221222e7-70e7-4d67-9aca-42eb134e2041
>   Total devices 1 FS bytes used 515.40GiB
>   devid1 size 931.51GiB used 522.02GiB path /dev/sdg1
> 
> This causes BTRFS to fail mounting the device [errors snipped]

> I've seen some patches on this list to add a btrfs device forget option,
> which I presume would help with a situation like this.  Is there a way
> to do that manually?

Without the mentioned patches, the only way (other than reboot) is to 
remove and reinsert the btrfs kernel module (assuming it's a module, not 
built-in), thus forcing it to forget state.

Of course if other critical mounted filesystems (such as root) are btrfs, 
or if btrfs is a kernel-built-in not a module and thus can't be removed, 
the above doesn't work and a reboot is necessary.  Thus the need for 
those patches you mentioned.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: IO errors when building RAID1.... ?

2018-08-31 Thread Duncan

Chris Murphy posted on Fri, 31 Aug 2018 13:02:16 -0600 as excerpted:

> If you want you can post the output from 'sudo smartctl -x /dev/sda'
> which will contain more information... but this is in some sense
> superfluous. The problem is very clearly a bad drive, the drive
> explicitly report to libata a write error, and included the sector LBA
> affected, and only the drive firmware would know that. It's not likely a
> cable problem or something like. And that the write error is reported at
> all means it's persistent, not transient.

Two points:

1) Does this happen to be an archive/SMR (shingled magnetic recording) 
device?  If so that might be the problem as such devices really aren't 
suited to normal usage (they really are designed for archiving), and 
btrfs' COW patterns can exacerbate the issue.  It's quite possible that 
the original install didn't load up the IO as heavily as the balance-
convert does, so the problem appears with convert but not for install.

2) Assuming it's /not/ an SMR issue, and smartctl doesn't say it's dying, 
I'd suggest running badblocks -w (make sure the device doesn't have 
anything valuable on it!) on the device -- note that this will take 
awhile, probably a couple days perhaps longer, as it writes four 
different patterns to the entire device one at a time, reading everything 
back to verify the pattern was written correctly, so it's actually going 
over the entire device 8 times, alternating write and read, but it should 
settle the issue of the reliability of the device.

Or if you'd rather spend the money than the time and it's not under 
warrantee still, just replace it, or at least buy a new one to use while 
you run the tests on that one.  I fully understand that tying up the 
thing running tests on it for days straight may not be viable.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: How to erase a RAID1 (+++)?

2018-08-31 Thread Duncan

Alberto Bursi posted on Fri, 31 Aug 2018 14:54:46 + as excerpted:

> I just keep around a USB drive with a full Linux system on it, to act as
> "recovery". If the btrfs raid fails I boot into that and I can do
> maintenance with a full graphical interface and internet access so I can
> google things.

I do very similar, except my "recovery boot" is my backup (with normally 
including for root two levels of backup/recovery available, three for 
some things).

I've actually gone so far as to have /etc/fstab be a symlink to one of 
several files, depending on what version of root vs. the off-root 
filesystems I'm booting, with a set of modular files that get assembled 
by scripts to build the fstabs as appropriate.  So updating fstab is a 
process of updating the modules, then running the scripts to create the 
actual fstabs, and after I update a root backup the last step is changing 
the symlink to point to the appropriate fstab for that backup, so it's 
correct if I end up booting from it.

Meanwhile, each root, working and two backups, is its own set of two 
device partitions in btrfs raid1 mode.  (One set of backups is on 
separate physical devices, covering the device death scenario, the other 
is on different partitions on the same, newer and larger pair of physical 
devices as the working set, so it won't cover device death but still 
covers fat-fingering, filesystem fubaring, bad upgrades, etc.)

/boot is separate and there's four of those (working and three backups), 
one each on each device of the two physical pairs, with the bios able to 
point to any of the four.  I run grub2, so once the bios loads that, I 
can interactively load kernels from any of the other three /boots and 
choose to boot any of the three roots.

And I build my own kernels, with an initrd attached as an initramfs to 
each, and test that they boot.  So selecting a kernel by definition 
selects its attached initramfs as well, meaning the initr*s are backed up 
and selected with the kernels.

(As I said earlier it'd sure be nice to be able to do away with the 
initr*s again.  I was actually thinking about testing that today, which 
was supposed to be a day off, but got called in to work, so the test will 
have to wait once again...)

What's nice about all that is that just as you said, each recovery/backup 
is a snapshot of the working system at the time I took the backup, so 
it's not a limited recovery boot at all, it has the same access to tools, 
manpages, net, X/plasma, browsers, etc, that my normal system does, 
because it /is/ my normal system from whenever I took the backup.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Duncan

ady for that at this point, and you're going to 
run into all sorts of problems trying to do it on an ongoing basis due to 
the above issues.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: DRDY errors are not consistent with scrub results

2018-08-29 Thread Duncan

Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted:

> Thinking again, this is totally acceptable. If the requirement was a
> good health disk, then I think I must check the disk health by myself.
> I may believe that the disk is in a good state, or make a quick test or
> make some very detailed tests to be sure.

For testing you might try badblocks.  It's most useful on a device that 
doesn't have a filesystem on it you're trying to save, so you can use the 
-w write-test option.  See the manpage for details.

The -w option should force the device to remap bad blocks where it can as 
well, and you can take your previous smartctl read and compare it to a 
new one after the test.

Hint if testing multiple spinning-rust devices:  Try running multiple 
tests at once.  While this might have been slower on old EIDE, at least 
with spinning rust, on SATA and similar you should be able to test 
multiple devices at once without them slowing down significantly, because 
the bottleneck is the spinning rust, not the bus, controller or CPU.  I 
used badblocks years ago to test my new disks before setting up mdraid on 
them, and with full disk tests on spinning rust taking (at the time) 
nearly a day a pass and four passes for the -w test, the multiple tests 
at once trick saved me quite a bit of time!

It's not a great idea to do the test on new SSDs as it's unnecessary 
wear, writing the entire device four times with different patterns each 
time for a -w, but it might be worthwhile to try it on an ssd you're just 
trying to salvage, forcing it to swap out any bad sectors it encounters 
in the process.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: btrfs-convert missing in btrfs-tools v4.15.1

2018-08-23 Thread Duncan

Nicholas D Steeves posted on Thu, 23 Aug 2018 14:15:18 -0400 as excerpted:

>> It's in my interest to ship all tools in distros, but there's also only
>> that much what the upstream community can do. If you're going to
>> reconsider the status of btrfs-convert in Debian, please let me know.
> 
> Yes, I'd be happy to advocate for its reinclusion if the answer to 4/5
> of the following questions is "yes".  Does SUSE now recommend the use of
> btrfs-convert to its enterprise customers?  The following is a
> frustrating criteria, but: Can a random desktop user run btrfs-convert
> against their ext4 rootfs and expect the operation to succeed?  Is
> btrfs-convert now sufficiently trusted that it can be recommended with
> the same degree of confidence as a backup, mkfs.btrfs, then restore to
> new filesystem approach?  Does the user of a btrfs volume created with
> btrfs-convert have an equal or lesser probability of encountering bugs
> compared to a one who used mkfs.btrfs?

Just a user and list regular here, and gentoo not debian, but for what it 
counts...

I'd personally never consider or recommend a filesystem converter over 
the backup, mkfs-to-new-fs, restore-to-new-fs, method, for three reasons.

1) Regardless of how stable a filesystem converter is and what two 
filesystems the conversion is between, "things" /do/ occasionally happen, 
thus making it irresponsible to use or recommend use of such a converter 
without a suitably current and tested backup, "just in case."

(This is of course a special case of the sysadmin's first rule of 
backups, that the true value of data is defined not by any arbitrary 
claims, but by the number of backups of that data it's considered worth 
the time/trouble/resources to make/have.  If the data value is trivial 
enough, sure, don't bother with the backup, but if it's of /that/ low a 
value, so low it's not worth a backup even when doing something as 
theoretically risky as a filesystem conversion, why is it worth the time 
and trouble to bother converting it in the first place, instead of just 
blowing it away and starting clean?)

2) Once a backup is considered "strongly recommended", as we've just 
established that it should be in 1 regardless of the stability of the 
converter, just using the existing filesystem as that backup and starting 
fresh with a mkfs for the new filesystem and copying things over is 
simply put the easiest, simplest and cleanest method to change 
filesystems.

3) (Pretty much)[1] Regardless of the filesystems in question, a fresh 
mkfs and clean sequential transfer of files from the old-fs/backup to the 
new one is pretty well guaranteed to be better optimized than conversion 
from an existing filesystem of a different type, particularly one that 
has been in normal operation for awhile and thus has operational 
fragmentation of both data and free-space.  That's in addition to being 
less bug-prone, even for a "stable" converter.


Restating: So(1) doing a conversion without a backup is irresponsible, 
(2) the easiest backup and conversion method is directly using the old fs 
as the backup, and copying over to the freshly mkfs-ed new filesystem, 
and (3) a freshly mkfs-ed filesystem and sequential copy of files to it 
from backup, whether that be the old filesystem or not, is going to be 
more efficient and less bug-prone than an in-place conversion.

Given the above, why would /anyone/ /sane/ consider using a converter?  
It simply doesn't make sense, even if the converter were as stable as the 
most stable filesystems we have.


So as a distro btrfs package maintainer, do what you wish in terms of the 
converter, but were it me, I might actually consider replacing it with an 
executable that simply printed out some form of the above argument, with 
a pointer to the sources should they still be interested after having 
read that argument.[2] Then, if people really are determined to 
unnecessarily waste their time to get a less efficient filesystem, 
possibly risking their data in the process of getting it, they can always 
build the converter from sources themselves.

---
[1] I debated omitting the qualifier as I know of no exceptions, but I'm 
not a filesystem expert and while I'm a bit skeptical, I suppose it's 
possible that they might exist.

[2] There's actually btrfs precedent for this in the form of the 
executable built as fsck.btrfs, which does nothing (successfully) but 
possibly print a message referring people to btrfs check, if run in 
interactive mode.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: lazytime mount option—no support in Btrfs

2018-08-22 Thread Duncan

Austin S. Hemmelgarn posted on Wed, 22 Aug 2018 07:30:09 -0400 as
excerpted:

>> Meanwhile, since broken rootflags requiring an initr* came up let me
>> take the opportunity to ask once again, does btrfs-raid1 root still
>> require an initr*?  It'd be /so/ nice to be able to supply the
>> appropriate rootflags=device=...,device=... and actually have it work
>> so I didn't need the initr* any longer!

> Last I knew, specifying appropriate `device=` options in rootflags works
> correctly without an initrd.

Just to confirm, that's with multi-device btrfs rootfs?  Because it used 
to work when the btrfs was single-device, but not multi-device.

(For multi-device, or at least raid1, one had to add degraded, also, or 
it would refuse to mount despite all the appropriate device= entries in 
rootflags, thus of course risking all the problems running degraded raid1 
operationally can bring, tho I never figured out for sure whether btrfs 
was smart enough to eventually pick up the other devices, after the scan 
before bringing other btrfs online or not, but either way it was a risk I 
wasn't willing to take.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: lazytime mount option—no support in Btrfs

2018-08-21 Thread Duncan

Austin S. Hemmelgarn posted on Tue, 21 Aug 2018 13:01:00 -0400 as
excerpted:

> Otherwise, the only option for people who want it set is to patch the
> kernel to get noatime as the default (instead of relatime).  I would
> look at pushing such a patch upstream myself actually, if it weren't for
> the fact that I'm fairly certain that it would be immediately NACK'ed by
> at least Linus, and probably a couple of other people too.

What about making default-noatime a kconfig option, presumably set to 
default-relatime by default?  That seems to be the way many legacy-
incompatible changes work.  Then for most it's up to the distro, which in 
fact it is already, only if the distro set noatime-default they'd at 
least be using an upstream option instead of patching it themselves, 
making it upstream code that could be accounted for instead of downstream 
code that... who knows?

Meanwhile, I'd be interested in seeing your local patch.  I'm local-
patching noatime-default here too, but not being a dev, I'm not entirely 
sure I'm doing it "correctly", tho AFAICT it does seem to work.  FWIW, 
here's what I'm doing (posting inline so may be white-space damaged, and 
IIRC I just recently manually updated the line numbers so they don't 
reflect the code at the 2014 date any more, but as I'm not sure of the 
"correctness" it's not intended to be applied in any case):

--- fs/namespace.c.orig 2014-04-18 23:54:42.167666098 -0700
+++ fs/namespace.c  2014-04-19 00:19:08.622741946 -0700
@@ -2823,8 +2823,9 @@ long do_mount(const char *dev_name, cons
goto dput_out;
 
/* Default to relatime unless overriden */
-   if (!(flags & MS_NOATIME))
-   mnt_flags |= MNT_RELATIME;
+   /* JED: Make that noatime */
+   if (!(flags & MS_RELATIME))
+   mnt_flags |= MNT_NOATIME;
 
/* Separate the per-mountpoint flags */
if (flags & MS_NOSUID)
@@ -2837,6 +2837,8 @@ long do_mount(const char *dev_name, cons
mnt_flags |= MNT_NOATIME;
if (flags & MS_NODIRATIME)
mnt_flags |= MNT_NODIRATIME;
+   if (flags & MS_RELATIME)
+   mnt_flags |= MNT_RELATIME;
if (flags & MS_STRICTATIME)
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)

Sane, or am I "doing it wrong!"(TM), or perhaps doing it correctly, but 
missing a chunk that should be applied elsewhere?


Meanwhile, since broken rootflags requiring an initr* came up let me take 
the opportunity to ask once again, does btrfs-raid1 root still require an 
initr*?  It'd be /so/ nice to be able to supply the appropriate 
rootflags=device=...,device=... and actually have it work so I didn't 
need the initr* any longer!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Duncan

 short: values representing quotas are user-oriented ("the numbers
>>> one bought"), not storage-oriented ("the numbers they actually
>>> occupy").

Btrfs quotas are storage-oriented, and if you're using them, at least 
directly, for user-oriented, you're using the proverbial screwdriver as a 
proverbial hammer.

> What is VFS disk quotas and does Btrfs use that at all? If not, why not?
> It seems to me there really should be a high level basic per directory
> quota implementation at the VFS layer, with a single kernel interface as
> well as a single user space interface, regardless of the file system.
> Additional file system specific quota features can of course have their
> own tools, but all of this re-invention of the wheel for basic directory
> quotas is a mystery to me.

As mentioned above and by others, btrfs quotas don't use vfs quotas (or 
the reverse, really, it'd be vfs quotas using information exposed by 
btrfs quotas... if it worked that way), because there's an API mis-match 
because their intended usage and the information they convey and control 
is different, and (AFAIK) was never intended or claimed to be the same.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: recover broken partition on external HDD

2018-08-06 Thread Duncan

ot having a backup, can be of only trivial value not worth 
the hassle.

There's no #3.  The data was either defined as worth a backup by virtue 
of having one, and can be restored from there, or it wasn't, but no big 
deal because the time/trouble/resources that would have otherwise gone 
into that backup was defined as more important, and was saved before the 
data was ever lost in the first place.

Thus, while the loss of the data due to fat-fingering (which all 
sysadmins come to appreciate the real risk of, after a few events of 
their own) the placement of that ZFS might be a bit of a bother, it's not 
worth spending huge amounts of time trying to recover, because it was 
either worth having a backup, in which case you simply recover from it, 
or it wasn't, in which case it's not worth spending huge amounts to time 
trying to recover, either.

Of course there's still the pre-disaster weighed risk that something will 
go wrong vs. the post-disaster it DID go wrong, now how do I best get 
back to normal operation question, but in the context of the backups rule 
above resolving that question is more a matter of whether it's most 
efficient to spend a little time trying to recover the existing data with 
no guarantee of full success, or to simply jump directly into the wipe 
and restore from known-good (because tested!) backups, which might take 
more time, but has a (near) 100% chance at recovery to the point of the 
backup.  (The slight chance of failure to recover from tested backups is 
what multiple levels of backups covers for, with the the value of the 
data and the weighed risk balanced against the value of the time/hassle/
resources necessary to do that one more level of backup.)

So while it might be worth a bit of time to quick-test recovery of the 
damaged data, it very quickly becomes not worth the further hassle, 
because either the data was already defined as not worth it due to not 
having a backup, or restoring from that backup will be faster and less 
hassle, with a far greater chance of success, than diving further into 
the data recovery morass, with ever more limited chances of success.

Live by that sort of policy from now on, and the results of the next 
failure, whether it be hardware, software, or wetware (another fat-
fingering, again, this is coming from someone, me, who has had enough of 
their own!), won't be anything to write the list about, unless of course 
it's a btrfs bug and quite apart from worrying about your data, you're 
just trying to get it fixed so it won't continue to happen.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS and databases

2018-08-01 Thread Duncan

MegaBrutal posted on Wed, 01 Aug 2018 05:45:15 +0200 as excerpted:

> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?
> 
> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW nature
> that is elsewhere a blessing, with databases it's a drawback). But are
> there any advantages of still sticking to BTRFS for a database albeit
> CoW is disabled, or should I just return to the old and reliable ext4
> for those applications?

Good question, on which I might expect some honest disagreement on the 
answer.

Personally, I tend to hate nocow with a passion, and would thus recommend 
putting databases and similar write-pattern (VM images...) files on their 
own dedicated non-btrfs (ext4, etc) if at all reasonable.

But that comes from a general split partition-favoring viewpoint, where 
doing another partition/lvm-volume and putting a different filesystem on 
it is no big deal, as it's just one more partition/volume to manage of 
(likely) several.

Some distros/companies/installations have policies strongly favoring 
btrfs for its "storage pool" features, trying to keep things simple and 
flexible by using just the one solution and one big btrfs and throwing 
everything onto it, often using btrfs subvolumes where others would use 
separate partitions/volumes with independent filesystems.  For these 
folks, the flexibility of being able to throw it all on one filesystem 
with subvolumes overrides the down sides of having to deal with nocow and 
its conditions, rules and additional risk.

And a big part of that flexibility, along with being a feature in its own 
right, is btrfs built-in multi-device, without having to resort to an 
additional multi-device layer such as lvm or mdraid.


So if you're using btrfs for multi-device or other features that nocow 
doesn't affect, it's plausible that you'd prefer nocow on btrfs to 
/having/ to do partitioning/lvm/mdraid and setup that separate non-btrfs 
just for your database (or vm image) files.

But from your post you're perfectly fine with partitioning and the like 
already, and won't consider it a heavy imposition to deal with a separate 
non-btrfs, ext4 or whatever, and in that case, at least here, I'd 
strongly recommend you do just that, avoiding the nocow that I honestly 
see as a compromise best left to those that really need it because they 
aren't prepared to deal with the hassle of setting up the separate 
filesystem along with all that entails.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: csum failed on raid1 even after clean scrub?

2018-08-01 Thread Duncan

Sterling Windmill posted on Mon, 30 Jul 2018 21:06:54 -0400 as excerpted:

> Both drives are identical, Seagate 8TB external drives

Are those the "shingled" SMR drives, normally sold as archive drives and 
first commonly available in the 8TB size, and often bought for their 
generally better price-per-TB without fully realizing the implications.

There have been bugs regarding those drives in the past, and while I 
believe those bugs were fixed and AFAIK current status is no known SMR-
specific bugs, they really are /not/ particularly suited to btrfs usage 
even for archiving, and definitely not to general usage (that is, pretty 
much anything but the straight-up archiving use-case they are sold for) 
use-cases.

Of course USB connections are notorious for being unreliable in terms of 
btrfs usage as well, and I'd really hate to think what a combination of 
SMR on USB might wreak.

If they're not SMR then carry-on! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: File permissions lost during send/receive?

2018-07-24 Thread Duncan

Marc Joliet posted on Tue, 24 Jul 2018 22:42:06 +0200 as excerpted:

> On my system I get:
> 
> % sudo getcap /bin/ping /sbin/unix_chkpwd
> /bin/ping = cap_net_raw+ep
> /sbin/unix_chkpwd = cap_dac_override+ep
> 
>> (getcap on unix_chkpwd returns nothing, but while I use kde/plasma I
>> don't normally use the lockscreen at all, so for all I know that's
>> broken here too.)

OK, after remerging pam, I get the same for unix_chkpwd (tho here I have 
sbin merge so it's /bin/unix_chkpwd with sbin -> bin), so indeed, it must 
have been the same problem for you with it, that I've simply not run into 
since whatever killed the filecaps here, because I don't use the 
lockscreen.

But if I start using the lockscreen again and it fails, I know one not-so-
intuitive thing to check, now. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: File permissions lost during send/receive?

2018-07-24 Thread Duncan

Andrei Borzenkov posted on Tue, 24 Jul 2018 20:53:15 +0300 as excerpted:

> 24.07.2018 15:16, Marc Joliet пишет:
>> Hi list,
>> 
>> (Preemptive note: this was with btrfs-progs 4.15.1, I have since
>> upgraded to 4.17.  My kernel version is 4.14.52-gentoo.)
>> 
>> I recently had to restore the root FS of my desktop from backup (extent
>> tree corruption; not sure how, possibly a loose SATA cable?). 
>> Everything was fine,
>> even if restoring was slower than expected.  However, I encountered two
>> files with permission problems, namely:
>> 
>> - /bin/ping, which caused running ping as a normal user to fail due to
>> missing permissions, and
>> 
>> - /sbin/unix_chkpwd (part of PAM), which prevented me from unlocking
>> the KDE Plasma lock screen; I needed to log into a TTY and run
>> "loginctl unlock- session".
>> 
>> Both were easily fixed by reinstalling the affected packages (iputils
>> and pam), but I wonder why this happened after restoring from backup.
>> 
>> I originally thought it was related to the SUID bit not being set,
>> because of the explanation in the ping(8) man page (section
>> "SECURITY"), but cannot find evidence of that -- that is, after
>> reinstallation, "ls -lh" does not show the sticky bit being set, or any
>> other special permission bits, for that matter:
>> 
>> % ls -lh /bin/ping /sbin/unix_chkpwd
>> -rwx--x--x 1 root root 60K 22. Jul 14:47 /bin/ping*
>> -rwx--x--x 1 root root 31K 23. Jul 00:21 /sbin/unix_chkpwd*
>> 
>> (Note: no ACLs are set, either.)
>> 
>> 
> What "getcap /bin/ping" says? You may need to install package providing
> getcap (libcap-progs here on openSUSE).

sys-libs/libcap on gentoo.  Here's what I get:

$ getcap /bin/ping
/bin/ping = cap_net_raw+ep

(getcap on unix_chkpwd returns nothing, but while I use kde/plasma I 
don't normally use the lockscreen at all, so for all I know that's broken 
here too.)

As hinted, it's almost certainly a problem with filecaps.  While I'll 
freely admit to not fully understanding how file-caps work, and my use-
case doesn't use send/receive, I do recall filecaps are what ping uses 
these days instead of SUID/SGID (on gentoo it'd be iputils' filecaps and 
possibly caps USE flags controlling this for ping), and also that btrfs 
send/receive did have a recent bugfix related to the extended-attributes 
normally used to record filecaps, so the symptoms match the bug and 
that's probably what you were seeing.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs filesystem corruptions with 4.18. git kernels

2018-07-21 Thread Duncan

Alexander Wetzel posted on Fri, 20 Jul 2018 23:28:42 +0200 as excerpted:

> A btrfs subvolume is used as the rootfs on a "Samsung SSD 850 EVO mSATA
> 1TB" and I'm running Gentoo ~amd64 on a Thinkpad W530. Discard is
> enabled as mount option and there were roughly 5 other subvolumes.

Regardless of what your trigger problem is, running with the discard 
mount option considerably increases your risks in at least two ways:

1) Btrfs normally has a feature that tracks old root blocks, which are 
COWed out at each commit.  Should something be wrong with the current 
one, btrfs can fall back to an older one using the usebackuproot 
(formerly recovery, but that clashed with the (no)recovery standard 
option a used on other OSs so they renamed it usebackuproot) mount 
option.  This won't always work, but when it does it's one of the first-
line recovery/repair options, as it tends to mean losing only 30-90 
seconds (first thru third old roots) worth of writes, while being quite 
likely to get you the working filesystem as it was at that commit.

But once the root goes unused, with discard, it gets marked for discard, 
and depending on the hardware/firmware implementation, it may be 
discarded immediately.  If it is, that means no backup roots available 
for recovery should the current root be bad for whatever reason, which 
pretty well takes out your first and best three chances of a quick fix 
without much risk.

2) In the past there have been bugs that triggered on discard.  AFAIK 
there are no such known bugs at this time, but in addition to the risk of 
point one, there is the additional risk of bugs that trigger on discard 
itself, and due to the nature of the discard feature itself, these sorts 
of bugs have a much higher chance than normal of being data eating bugs.

3) Depending on the device, the discard mount option may or may not have 
negative performance implications as well.

So while the discard mount option is there, it's definitely not 
recommended, unless you really are willing to deal with that extra risk 
and the loss of the backuproot safety-nets, and of course have 
additionally researched its effects on your hardware to make sure it's 
not actually slowing you down (which granted, on good mSATA, it may not 
be, as those are new enough to have a higher likelihood of actually 
having working queued-trim support).

The discard mount option alternative is a scheduled timer/cron job (like 
the one systemd has, just activate it) that does a periodic (weekly for 
systemd's timer) fstrim.  That lowers the risk to the few commits 
immediately after the fstrim job runs -- as long as you don't crash 
during that time, you'll have backup roots available as the current root 
will have moved on since then, creating backups again as it did so.

Or just leave a bit of extra room on the ssd untouched (ideally initially 
trimmed before partitioning and then left unpartitioned, so the firmware 
knows its clean and can use it at its convenience), so the ssd can use 
that extra room to do its wear-leveling, and don't do trim/discard at all.

FWIW I actually do both of these here, leaving significant space on the 
device unpartitioned, and enabling that systemd fstrim timer job, as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Duncan

Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:

>> As implemented in BTRFS, raid1 doesn't have striping.
> 
> The argument is that because there's only two copies, on multi-device
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> alternate device pairs, it's effectively striped at the macro level,
> with the 1 GiB device-level chunks effectively being huge individual
> device strips of 1 GiB.
> 
> At 1 GiB strip size it doesn't have the typical performance advantage of
> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> strips/chunks.

I forgot this bit...

Similarly, multi-device single is regarded by some to be conceptually 
equivalent to raid0 with really huge GiB strips/chunks.

(As you may note, "the argument is" and "regarded by some" are distancing 
phrases.  I've seen the argument made on-list, but while I understand the 
argument and agree with it to some extent, I'm still a bit uncomfortable 
with it and don't normally make it myself, this thread being a noted 
exception tho originally I simply repeated what someone else already said 
in-thread, because I too agree it's stretching things a bit.  But it does 
appear to be a useful conceptual equivalency for some, and I do see the 
similarity.

Perhaps it's a case of coder's view (no code doing it that way, it's just 
a coincidental oddity conditional on equal sizes), vs. sysadmin's view 
(code or not, accidental or not, it's a reasonably accurate high-level 
description of how it ends up working most of the time with equivalent 
sized devices).)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Duncan

Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:

> On 07/17/2018 11:12 PM, Duncan wrote:
>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>> excerpted:
>> 
>>> On 07/15/2018 04:37 PM, waxhead wrote:
>> 
>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>> parity are mutually exclusive.
>> 
>> I can't agree.  I don't know whether you meant that in the global
>> sense,
>> or purely in the btrfs context (which I suspect), but either way I
>> can't agree.
>> 
>> In the pure btrfs context, while striping and mirroring/pairing are
>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>> flexible enough to allow both together and the feature may at some
>> point be added, so it makes sense to have a layout notation format
>> flexible enough to allow it as well.
> 
> When I say orthogonal, It means that these can be combined: i.e. you can
> have - striping (RAID0)
> - parity  (?)
> - striping + parity  (e.g. RAID5/6)
> - mirroring  (RAID1)
> - mirroring + striping  (RAID10)
> 
> However you can't have mirroring+parity; this means that a notation
> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
> too verbose.

Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
top of mirroring or mirroring on top of raid5/6, much as raid10 is 
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
on top of raid0.  

While it's not possible today on (pure) btrfs (it's possible today with 
md/dm-raid or hardware-raid handling one layer), it's theoretically 
possible both for btrfs and in general, and it could be added to btrfs in 
the future, so a notation with the flexibility to allow parity and 
mirroring together does make sense, and having just that sort of 
flexibility is exactly why Hugo made the notation proposal he did.

Tho a sensible use-case for mirroring+parity is a different question.  I 
can see a case being made for it if one layer is hardware/firmware raid, 
but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61 
(or 15 or 51) might be, where pure mirroring or pure parity wouldn't 
arguably be a at least as good a match to the use-case.  Perhaps one of 
the other experts in such things here might help with that.

>>> Question #2: historically RAID10 is requires 4 disks. However I am
>>> guessing if the stripe could be done on a different number of disks:
>>> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
>>> that every 64k, the data are stored on a different disk
>> 
>> As someone else pointed out, md/lvm-raid10 already work like this. 
>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>> much works this way except with huge (gig size) chunks.
> 
> As implemented in BTRFS, raid1 doesn't have striping.

The argument is that because there's only two copies, on multi-device 
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to 
alternate device pairs, it's effectively striped at the macro level, with 
the 1 GiB device-level chunks effectively being huge individual device 
strips of 1 GiB.

At 1 GiB strip size it doesn't have the typical performance advantage of 
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB 
strips/chunks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-17 Thread Duncan

Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:

> On 07/15/2018 04:37 PM, waxhead wrote:

> Striping and mirroring/pairing are orthogonal properties; mirror and
> parity are mutually exclusive.

I can't agree.  I don't know whether you meant that in the global sense, 
or purely in the btrfs context (which I suspect), but either way I can't 
agree.

In the pure btrfs context, while striping and mirroring/pairing are 
orthogonal today, Hugo's whole point was that btrfs is theoretically 
flexible enough to allow both together and the feature may at some point 
be added, so it makes sense to have a layout notation format flexible 
enough to allow it as well.

In the global context, just to complete things and mostly for others 
reading as I feel a bit like a simpleton explaining to the expert here, 
just as raid10 is shorthand for raid1+0, aka raid0 layered on top of 
raid1 (normally preferred to raid01 due to rebuild characteristics, and 
as opposed to raid01, aka raid0+1, aka raid1 on top of raid0, sometimes 
recommended as btrfs raid1 on top of whatever raid0 here due to btrfs' 
data integrity characteristics and less optimized performance), so 
there's also raid51 and raid15, raid61 and raid16, etc, with or without 
the + symbols, involving mirroring and parity conceptually at two 
different levels altho they can be combined in a single implementation 
just as raid10 and raid01 commonly are.  These additional layered-raid 
levels can be used for higher reliability, with differing rebuild and 
performance characteristics between the two forms depending on which is 
the top layer.

> Question #1: for "parity" profiles, does make sense to limit the maximum
> disks number where the data may be spread ? If the answer is not, we
> could omit the last S. IMHO it should.

As someone else already replied, btrfs doesn't currently have the ability 
to specify spread limit, but the idea if we're going to change the 
notation is to allow for the flexibility in the new notation so the 
feature can be added later without further notation changes.

Why might it make sense to specify spread?  At least two possible reasons:

a) (stealing an already posted example) Consider a multi-device layout 
with two or more device sizes.  Someone may want to limit the spread in 
ordered to keep performance and risk consistent as the smaller devices 
fill up, limiting further usage to a lower number of devices.  If that 
lower number is specified as the spread originally it'll make things more 
consistent between the room on all devices case and the room on only some 
devices case.

b) Limiting spread can change the risk and rebuild performance profiles.  
Stripes of full width mean all stripes have a strip on each device, so 
knock a device out and (assuming parity or mirroring) replace it, and all 
stripes are degraded and must be rebuilt.  With less than maximum spread, 
some stripes won't be stripped to the replaced device, and won't be 
degraded or need rebuilt, tho assuming the same overall fill, a larger 
percentage of stripes that /do/ need rebuilt will be on the replaced 
device.  So the risk profile is more "objects" (stripes/chunks/files) 
affected but less of each object, or less of the total affected, but more 
of each affected object.

> Question #2: historically RAID10 is requires 4 disks. However I am
> guessing if the stripe could be done on a different number of disks:
> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
> that every 64k, the data are stored on a different disk

As someone else pointed out, md/lvm-raid10 already work like this.  What 
btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much 
works this way except with huge (gig size) chunks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-08 Thread Duncan

Andrei Borzenkov posted on Fri, 06 Jul 2018 07:28:48 +0300 as excerpted:

> 03.07.2018 10:15, Duncan пишет:
>> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as
>> excerpted:
>> 
>>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>>> bit dangerous to do it while writes are happening).
>>>
>>> Could you please elaborate? Do you mean btrfs can trim data before new
>>> writes are actually committed to disk?
>> 
>> No.
>> 
>> But normally old roots aren't rewritten for some time simply due to
>> odds (fuller filesystems will of course recycle them sooner), and the
>> btrfs mount option usebackuproot (formerly recovery, until the
>> norecovery mount option that parallels that of other filesystems was
>> added and this option was renamed to avoid confusion) can be used to
>> try an older root if the current root is too damaged to successfully
>> mount.

>> But other than simply by odds not using them again immediately, btrfs
>> has
>> no special protection for those old roots, and trim/discard will
>> recover them to hardware-unused as it does any other unused space, tho
>> whether it simply marks them for later processing or actually processes
>> them immediately is up to the individual implementation -- some do it
>> immediately, killing all chances at using the backup root because it's
>> already zeroed out, some don't.
>> 
>> 
> How is it relevant to "while writes are happening"? Will trimming old
> tress immediately after writes have stopped be any different? Why?

Define "while writes are happening" vs. "immediately after writes have 
stopped".  How soon is "immediately", and does the writes stopped 
condition account for data that has reached the device-hardware write 
buffer (so is no longer being transmitted to the device across the bus) 
but not been actually written to media, or not?

On a reasonably quiescent system, multiple empty write cycles are likely 
to have occurred since the last write barrier, and anything in-process is 
likely to have made it to media even if software is missing a write 
barrier it needs (software bug) or the hardware lies about honoring the 
write barrier (hardware bug, allegedly sometimes deliberate on hardware 
willing to gamble with your data that a crash won't happen in a critical 
moment, a somewhat rare occurrence, in ordered to improve normal 
operation performance metrics).

On an IO-maxed system, data and write-barriers are coming down as fast as 
the system can handle them, and write-barriers become critical -- crash 
after something was supposed to get to media but didn't, either because 
of a missing write barrier or because the hardware/firmware lied about 
the barrier and said the data it was supposed to ensure was on-media was, 
when it wasn't, and the btrfs atomic-cow commit guarantees of consistent 
state at each commit go out the window.

At this point it becomes useful to have a number of previous "guaranteed 
consistent state" roots to fall back on, with the /hope/ being that at 
least /one/ of them is usably consistent.  If all but the last one are 
wiped due to trim...

When the system isn't write-maxed the write will have almost certainly 
made it regardless of whether the barrier is there or not, because 
there's enough idle time to finish the current write before another one 
comes down the pipe, so the last-written root is almost certain to be 
fine regardless of barriers, and the history of past roots doesn't matter 
even if there's a crash.

If "immediately after writes have stopped" is strictly defined as a 
condition when all writes including the btrfs commit updating the current 
root and the superblock pointers to the current root have completed, with 
no new writes coming down the pipe in the mean time that might have 
delayed a critical update if a barrier was missed, then trimming old 
roots in this state should be entirely safe, and the distinction between 
that state and the "while writes are happening" is clear.

But if "immediately after writes have stopped" is less strictly defined, 
then the distinction between that state and "while writes are happening" 
remains blurry at best, and having old roots around to fall back on in 
case a write-barrier was missed (for whatever reason, hardware or 
software) becomes a very good thing.

Of course the fact that trim/discard itself is an instruction written to 
the device in the combined command/data stream complexifies the picture 
substantially.  If those write barriers get missed who knows what state 
the new root is in, and if the old ones got erased...  But again, on a 
mostly idle system, it'll probably al

Re: unsolvable technical issues?

2018-07-03 Thread Duncan

Austin S. Hemmelgarn posted on Mon, 02 Jul 2018 07:49:05 -0400 as
excerpted:

> Notably, most Intel systems I've seen have the SATA controllers in the
> chipset enumerate after the USB controllers, and the whole chipset
> enumerates after add-in cards (so they almost always have this issue),
> while most AMD systems I've seen demonstrate the exact opposite
> behavior,
> they enumerate the SATA controller from the chipset before the USB
> controllers, and then enumerate the chipset before all the add-in cards
> (so they almost never have this issue).

Thanks.  That's a difference I wasn't aware of, and would (because I tend 
to favor amd) explain why I've never seen a change in enumeration order 
unless I've done something like unplug my sata cables for maintenance and 
forget which ones I had plugged in where -- random USB stuff left plugged 
in doesn't seem to matter, even choosing different boot media from the 
bios doesn't seem to matter by the time the kernel runs (I'm less sure 
about grub).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-03 Thread Duncan

Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:

> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>> bit dangerous to do it while writes are happening).
> 
> Could you please elaborate? Do you mean btrfs can trim data before new
> writes are actually committed to disk?

No.

But normally old roots aren't rewritten for some time simply due to odds 
(fuller filesystems will of course recycle them sooner), and the btrfs 
mount option usebackuproot (formerly recovery, until the norecovery mount 
option that parallels that of other filesystems was added and this option 
was renamed to avoid confusion) can be used to try an older root if the 
current root is too damaged to successfully mount.

But other than simply by odds not using them again immediately, btrfs has 
no special protection for those old roots, and trim/discard will recover 
them to hardware-unused as it does any other unused space, tho whether it 
simply marks them for later processing or actually processes them 
immediately is up to the individual implementation -- some do it 
immediately, killing all chances at using the backup root because it's 
already zeroed out, some don't.

In the context of the discard mount option, that can mean there's never 
any old roots available ever, as they've already been cleaned up by the 
hardware due to the discard option telling the hardware to do it.

But even not using that mount option, and simply doing the trims 
periodically, as done weekly by for instance the systemd fstrim timer and 
service units, or done manually if you prefer, obviously potentially 
wipes the old roots at that point.  If the system's effectively idle at 
the time, not much risk as the current commit is likely to represent a 
filesystem in full stasis, but if there's lots of writes going on at that 
moment *AND* the system happens to crash at just the wrong time, before 
additional commits have recreated at least a bit of root history, again, 
you'll potentially be left without any old roots for the usebackuproot 
mount option to try to fall back to, should it actually be necessary.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs send/receive vs rsync

2018-06-30 Thread Duncan

Marc MERLIN posted on Fri, 29 Jun 2018 09:24:20 -0700 as excerpted:

>> If instead of using a single BTRFS filesystem you used LVM volumes
>> (maybe with Thin provisioning and monitoring of the volume group free
>> space) for each of your servers to backup with one BTRFS filesystem per
>> volume you would have less snapshots per filesystem and isolate
>> problems in case of corruption. If you eventually decide to start from
>> scratch again this might help a lot in your case.
> 
> So, I already have problems due to too many block layers:
> - raid 5 + ssd - bcache - dmcrypt - btrfs
> 
> I get occasional deadlocks due to upper layers sending more data to the
> lower layer (bcache) than it can process. I'm a bit warry of adding yet
> another layer (LVM), but you're otherwise correct than keeping smaller
> btrfs filesystems would help with performance and containing possible
> damage.
> 
> Has anyone actually done this? :)

So I definitely use (and advocate!) the split-em-up strategy, and I use 
btrfs, but that's pretty much all the similarity we have.

I'm all ssd, having left spinning rust behind.  My strategy avoids 
unnecessary layers like lvm (tho crypt can arguably be necessary), 
preferring direct on-device (gpt) partitioning for simplicity of 
management and disaster recovery.  And my backup and recovery strategy is 
an equally simple mkfs and full-filesystem-fileset copy to an identically 
sized filesystem, with backups easily bootable/mountable in place of the 
working copy if necessary, and multiple backups so if disaster takes out 
the backup I was writing at the same time as the working copy, I still 
have a backup to fall back to.

So it's different enough I'm not sure how much my experience will help 
you.  But I /can/ say the subdivision is nice, as it means I can keep my 
root filesystem read-only by default for reliability, my most-at-risk log 
filesystem tiny for near-instant scrub/balance/check, and my also at risk 
home small as well, with the big media files being on a different 
filesystem that's mostly read-only, so less at risk and needing less 
frequent backups.  The tiny boot and large updates (distro repo, sources, 
ccache) are also separate, and mounted only for boot maintenance or 
updates.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-30 Thread Duncan

 it, if there's anything an admin knows *well* it's that the 
working copy of data **WILL** be damaged.  It's not a matter of if, but 
of when, and of whether it'll be a fat-finger mistake, or a hardware or 
software failure, or wetware (theft, ransomware, etc), or wetware (flood, 
fire and the water that put it out damage, etc), tho none of that 
actually matters after all, because in the end, the only thing that 
matters was how the value of that data was defined by the number of 
backups made of it, and how quickly and conveniently at least one of 
those backups can be retrieved and restored.


Meanwhile, an admin worth the label will also know the relative risk 
associated with various options they might use, including nocow, and 
knowing that downgrades the stability rating of the storage approximately 
to the same degree that raid0 does, they'll already be aware that in such 
a case the working copy can only be defined as "throw-away" level in case 
of problems in the first place, and will thus not even consider their 
working copy to be a permanent copy at all, just a temporary garbage 
copy, only slightly more reliable than one stored on tmpfs, and will thus 
consider the first backup thereof the true working copy, with an 
additional level of backup beyond what they'd normally have thrown in to 
account for that fact.

So in case of problems people can simply restore nocow files from a near-
line stable working copy, much as they'd do after reboot or a umount/
remount cycle for a file stored in tmpfs.  And if they didn't have even a 
stable working copy let alone a backup... well, much like that file in 
tmpfs, what did they expect?  They *really* defined that data as of no 
more than trivial value, didn't they?


All that said, making the NOCOW warning labels a bit more bold print 
couldn't hurt; and making scrub in the nocow case at least compare copies 
and report differences, simply makes it easier for people to know they 
need to reach for that near-line stable working copy, or mkfs and start 
from scratch if they defined the data value as not worth the trouble of 
(in this case) even a stable working copy, let alone a backup, so that'd 
be a good thing too. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: unsolvable technical issues?

2018-06-29 Thread Duncan

Hugo Mills posted on Mon, 25 Jun 2018 16:54:36 + as excerpted:

> On Mon, Jun 25, 2018 at 06:43:38PM +0200, waxhead wrote:
> [snip]
>> I hope I am not asking for too much (but I know I probably am), but I
>> suggest that having a small snippet of information on the status page
>> showing a little bit about what is either currently the development
>> focus , or what people are known for working at would be very valuable
>> for users and it may of course work both ways, such as exciting people
>> or calming them down. ;)
>> 
>> For example something simple like a "development focus" list...
>> 2018-Q4: (planned) Renaming the grotesque "RAID" terminology
>> 2018-Q3: (planned) Magical feature X
>> 2018-Q2: N-Way mirroring
>> 2018-Q1: Feature work "RAID"5/6
>> 
>> I think it would be good for people living their lives outside as it
>> would perhaps spark some attention from developers and perhaps even
>> media as well.
> 
> I started doing this a couple of years ago, but it turned out to be
> impossible to keep even vaguely accurate or up to date, without going
> round and bugging the developers individually on a per-release basis. I
> don't think it's going to happen.

In addition, anything like quarter, kernel cycle, etc, has been 
repeatedly demonstrated to be entirely broken beyond "current", because 
roadmapped tasks have rather consistently taken longer, sometimes /many/ 
/times/ longer (by a factor of 20+ in the case of raid56), than first 
predicted.

But in theory it might be double, with just a roughly ordered list, no 
dates beyond "current focus", and with suitably big disclaimers about 
other things (generally bugs in otherwise more stable features, but 
occasionally a quick sub-feature that is seen to be easier to introduce 
at the current state than it might be later, etc) possibly getting 
priority and temporarily displacing roadmapped items.

In fact, this last one is the big reason why raid56 has taken so long to 
even somewhat stabilize -- the devs kept finding bugs in already semi-
stable features that took priority... for kernel cycle after kernel 
cycle.  The quotas/qgroups feature, already introduced and intended to be 
at least semi-stable was one such culprit, requiring repeated rewrite and 
kernel cycles worth of bug squashing.  A few critical under the right 
circumstances compression bugs, where compression was supposed to be an 
already reasonably stable feature, were another, tho these took far less 
developer bandwidth than quotas.  Getting a reasonably usable fsck was a 
bunch of little patches.  AFAIK that one wasn't actually an original 
focus and was intended to be back-burnered for some time, but once btrfs 
hit mainline, users started demanding it, so the priority was bumped.  
And of course having it has been good for finding and ultimately fixing 
other bugs as well, so it wasn't a bad thing, but the hard fact is the 
repairing fsck has taken, all told, I'd guess about the same number of 
developer cycles as quotas, and those developer cycles had to come from 
stuff that had been roadmapped for earlier.

As a bit of an optimist I'd be inclined to argue that OK, we've gotten 
btrfs in far better shape general stability-wise now, and going forward, 
the focus can be back on the stuff that was roadmapped for earlier that 
this stuff displaced, so one might hope things will move faster again 
now, but really, who knows?  That's arguably what the devs thought when 
they mainlined btrfs, too, and yet it took all this much longer to mature 
and stabilize since then.  Still, it /has/ to happen at /some/ point, 
right?  And I know for a fact that btrfs is far more stable now than it 
was... because things like ungraceful shutdowns that used to at minimum 
trigger (raid1 mode) scrub fixes on remount and scrub, now... don't -- 
btrfs is now stable enough that the atomic COW is doing its job and 
things "just work", where before, they required scrub repair at best, and 
occasional blow away and restore from backups.  So I can at least /hope/ 
that the worst of the plague of bugs is behind us, and people can work on 
what they intended to do most (say 80%) of the time now, spending say a 
day's worth a week (20%) on bugs, instead of the reverse, 80% (4 days a 
week) on bugs and if they're lucky, a day a week on what they were 
supposed to be focused on, which is what we were seeing for awhile.

Plus the tools to do the debugging, etc, are far more mature now, another 
reason bugs should hopefully take less time now.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: unsolvable technical issues?

2018-06-29 Thread Duncan

Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as
excerpted:

> On 2018-06-24 16:22, Goffredo Baroncelli wrote:
>> On 06/23/2018 07:11 AM, Duncan wrote:
>>> waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted:
>>>
>>>> According to this:
>>>>
>>>> https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 ,
>>>> section 1.2
>>>>
>>>> It claims that BTRFS still have significant technical issues that may
>>>> never be resolved.
>>>
>>> I can speculate a bit.
>>>
>>> 1) When I see btrfs "technical issue that may never be resolved", the
>>> #1 first thing I think of, that AFAIK there are _definitely_ no plans
>>> to resolve, because it's very deeply woven into the btrfs core by now,
>>> is...
>>>
>>> [1)] Filesystem UUID Identification.  Btrfs takes the UU bit of
>>> Universally Unique quite literally, assuming they really *are*
>>> unique, at least on that system[.]  Because
>>> btrfs uses this supposedly unique ID to ID devices that belong to the
>>> filesystem, it can get *very* mixed up, with results possibly
>>> including dataloss, if it sees devices that don't actually belong to a
>>> filesystem with the same UUID as a mounted filesystem.
>> 
>> As partial workaround you can disable udev btrfs rules and then do a
>> "btrfs dev scan" manually only for the device which you need.

> You don't even need `btrfs dev scan` if you just specify the exact set
> of devices in the mount options.  The `device=` mount option tells the
> kernel to check that device during the mount process.

Not that lvm does any better in this regard[1], but has btrfs ever solved 
the bug where only one device= in the kernel commandline's rootflags= 
would take effect, effectively forcing initr* on people (like me) who 
would otherwise not need them and prefer to do without them, if they're 
using a multi-device btrfs as root?

Not to mention the fact that as kernel people will tell you, device 
enumeration isn't guaranteed to be in the same order every boot, so 
device=/dev/* can't be relied upon and shouldn't be used -- but of course 
device=LABEL= and device=UUID= and similar won't work without userspace, 
basically udev (if they work at all, IDK if they actually do).

Tho in practice from what I've seen, device enumeration order tends to be 
dependable /enough/ for at least those without enterprise-level numbers 
of devices to enumerate.  True, it /does/ change from time to time with a 
new kernel, but anybody sane keeps a tested-dependable old kernel around 
to boot to until they know the new one works as expected, and that sort 
of change is seldom enough that users can boot to the old kernel and 
adjust their settings for the new one as necessary when it does happen.  
So as "don't do it that way because it's not reliable" as it might indeed 
be in theory, in practice, just using an ordered /dev/* in kernel 
commandlines does tend to "just work"... provided one is ready for the 
occasion when that device parameter might need a bit of adjustment, of 
course.

> Also, while LVM does have 'issues' with cloned PV's, it fails safe (by
> refusing to work on VG's that have duplicate PV's), while BTRFS fails
> very unsafely (by randomly corrupting data).

And IMO that "failing unsafe" is both serious and common enough that it 
easily justifies adding the point to a list of this sort, thus my putting 
it #1.

>>> 2) Subvolume and (more technically) reflink-aware defrag.
>>>
>>> It was there for a couple kernel versions some time ago, but
>>> "impossibly" slow, so it was disabled until such time as btrfs could
>>> be made to scale rather better in this regard.

> I still contend that the biggest issue WRT reflink-aware defrag was that
> it was not optional.  The only way to get the old defrag behavior was to
> boot a kernel that didn't have reflink-aware defrag support.  IOW,
> _everyone_ had to deal with the performance issues, not just the people
> who wanted to use reflink-aware defrag.

Absolutely.

Which of course suggests making it optional, with a suitable warning as 
to the speed implications with lots of snapshots/reflinks, when it does 
get enabled again (and as David mentions elsewhere, there's apparently 
some work going into the idea once again, which potentially moves it from 
the 3-5 year range, at best, back to a 1/2-2-year range, time will tell).

>>> 3) N-way-mirroring.
>>>
>> [...]
>> This is not an issue, but a not implemented feature
> If you're looking at feature parity with competitors, it's an issue.

Exactly my point.  Thanks. =:^)

>>> 4)

Re: unsolvable technical issues?

2018-06-22 Thread Duncan

 introduced in 3.6.  I know because this is the one I've been most 
looking forward to personally, tho my original reason, aging but still 
usable devices that I wanted extra redundancy for, has long since itself 
been aged out of rotation.

Of course we know the raid56 story and thus the implied delay here, if 
it's even still roadmapped at all now, and as with reflink-aware-defrag, 
there's no hint yet as to when we'll actually see this at all, let alone 
see it in a reasonably stable form, so at least in the practical sense, 
it's arguably "might never be resolved."

4) (Until relatively recently, and still in terms of scaling) Quotas.

Until relatively recently, quotas could arguably be added to the list.  
They were rewritten multiple times, and until recently, appeared to be 
effectively eternally broken.

While that has happily changed recently and (based on the list, I don't 
use 'em personally) quotas actually seem at least someone usable these 
days (altho less critical bugs are still being fixed), AFAIK quota 
scalability while doing btrfs maintenance remains a serious enough issue 
that the recommendation is to turn them off before doing balances, and 
the same would almost certainly apply to reflink-aware-defrag (turn 
quotas off before defraging) were it available, as well.  That 
scalability alone could arguably be a "technical issue that may never be 
resolved", and while quotas themselves appear to be reasonably functional 
now, that could arguably justify them still being on the list.


And of course that's avoiding the two you mentioned, tho arguably they 
could go on the "may in practice never be resolved, at least not in the 
non-bluesky lifetime" list as well.


As for stratis, supposedly they're deliberately taking existing proven in 
multi-layer-form technology and simply exposing it in unified form.  They 
claim this dramatically lessens the required new code and shortens time-
to-stability to something reasonable, in contrast to the about a decade 
btrfs has taken already, without yet reaching a full feature set and full 
stability.  IMO they may well have a point, tho AFAIK they're still new 
and immature themselves and (I believe) don't have it either, so it's a 
point that AFAIK has yet to be fully demonstrated.

We'll see how they evolve.  I do actually expect them to move faster than 
btrfs, but also expect the interface may not be as smooth and unified as 
they'd like to present as I expect there to remain some hiccups in 
smoothing over the layering issues.  Also, because they've deliberately 
chosen to go with existing technology where possible in ordered to evolve 
to stability faster, by the same token they're deliberately limiting the 
evolution to incremental over existing technology, and I expect there's 
some stuff btrfs will do better as a result... at least until btrfs (or a 
successor) becomes stable enough for them to integrate (parts of?) it as 
existing demonstrated-stable technology.

The other difference, AFAIK, is that stratis is specifically a 
corporation making it a/the main money product, whereas btrfs was always 
something the btrfs devs used at their employers (oracle, facebook), who 
have other things as their main product.  As such, stratis is much more 
likely to prioritize things like raid status monitors, hot-spares, etc, 
that can be part of the product they sell, where they've been lower 
priority for btrfs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56

2018-06-20 Thread Duncan

Gandalf Corvotempesta posted on Wed, 20 Jun 2018 11:15:03 +0200 as
excerpted:

> Il giorno mer 20 giu 2018 alle ore 10:34 Duncan <1i5t5.dun...@cox.net>
> ha scritto:
>> Parity-raid is certainly nice, but mandatory, especially when there's
>> already other parity solutions (both hardware and software) available
>> that btrfs can be run on top of, should a parity-raid solution be
>> /that/ necessary?
> 
> You can't be serious. hw raid as much more flaws than any sw raid.

I didn't say /good/ solutions, I said /other/ solutions.
FWIW, I'd go for mdraid at the lower level, were I to choose, here.

But for a 4-12-ish device solution, I'd probably go btrfs raid1 on a pair 
of mdraid-0s.  That gets you btrfs raid1 data integrity and recovery from 
its other mirror, while also being faster than the still not optimized 
btrfs raid 10.  Beyond about a dozen devices, six per "side" of the btrfs 
raid1, the risk of multi-device breakdown before recovery starts to get 
too high for comfort, but six 8 TB devices in raid0 gives you up to 48 TB 
to work with, and more than that arguably should be broken down into 
smaller blocks to work with in any case, because otherwise you're simply 
dealing with so much data it'll take you unreasonably long to do much of 
anything non-incremental with it, from any sort of fscks or btrfs 
maintenance, to trying to copy or move the data anywhere (including for 
backup/restore purposes), to ... whatever.

Actually, I'd argue that point is reached well before 48 TB, but the 
point remains, at some point it's just too much data to do much of 
anything with, too much to risk losing all at once, too much to backup 
and restore all at once as it just takes too much time to do it, just too 
much...  And that point's well within ordinary raid sizes with a dozen 
devices or less, mirrored, these days.

Which is one of the reasons I'm so skeptical about parity-raid being 
mandatory "nowadays".  Maybe it was in the past, when disks were (say) 
half a TB or less and mirroring a few TB of data was resource-
prohibitive, but now?

Of course we've got a guy here who works with CERN and deals with their 
annual 50ish petabytes of data (49 in 2016, see wikipedia's CERN 
article), but that's simply problems on a different scale.

Even so, I'd say it needs broken up into manageable chunks, and 50 PB is 
"only" a bit over 1000 48 TB filesystems worth.  OK, say 2000, so you're 
not filling them all absolutely full.

Meanwhile, I'm actually an N-way-mirroring proponent, here, as opposed to 
a parity-raid proponent.  And at that sort of scale, you /really/ don't 
want to have to restore from backups, so 3-way or even 4-5 way mirroring 
makes a lot of sense.  Hmm... 2.5 dozen for 5-way-mirroring, 2000 times, 
2.5*12*2000=... 60K devices!  That's a lot of hard drives!  And a lot of 
power to spin them.  But I guess it's a rounding error compared to what 
CERN uses for the LHC.

FWIW, N-way-mirroring has been on the btrfs roadmap, since at least 
kernel 3.6, for "after raid56".  I've been waiting awhile too; no sign of 
it yet so I guess I'll be waiting awhile longer.  So as they say, 
"welcome to the club!"  I'm 51 now.  Maybe I'll see it before I die.  
Imagine, I'm in my 80s in the retirement home and get the news btrfs 
finally has N-way-mirroring in mainline.  I'll be jumping up and down and 
cause a ruckus when I break my hip!  Well, hoping it won't be /that/ 
long, but... =;^]

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs balance did not progress after 12H

2018-06-20 Thread Duncan

Austin S. Hemmelgarn posted on Tue, 19 Jun 2018 12:58:44 -0400 as
excerpted:

> That said, I would question the value of repacking chunks that are
> already more than half full.  Anything above a 50% usage filter
> generally takes a long time, and has limited value in most cases (higher
> values are less likely to reduce the total number of allocated chunks).
> With `-duszge=50` or less, you're guaranteed to reduce the number of
> chunk if at least two match, and it isn't very time consuming for the
> allocator, all because you can pack at least two matching chunks into
> one 'new' chunk (new in quotes because it may re-pack them into existing
> slack space on the FS). Additionally, `-dusage=50` is usually sufficient
> to mitigate the typical ENOSPC issues that regular balancing is supposed
> to help with.

While I used to agree, 50% for best efficiency, perhaps 66 or 70% if 
you're really pressed for space, now that the allocator can repack into 
existing chunks more efficiently than it used to (at least in ssd mode, 
which all my storage is now), I've seen higher values result in practical/
noticeable recovery of space to unallocated as well.

In fact, I routinely use usage=70 these days, and sometimes use higher, 
to 99 or even 100%[1].  But of course I'm on ssd so it's far faster, and 
partition it up with the biggest partitions being under 100 GiB, so even 
full unfiltered balances are normally under 10 minutes and normal 
filtered balances under a minute, to the point I usually issue the 
balance command and actually wait for completion, so it's a far different 
ball game than issuing a balance command on a multi-TB hard drive and 
expecting it to take hours or even days.  In that case, yeah, a 50% cap 
arguably makes sense, tho he was using 60, which still shouldn't (sans 
bugs like we seem to have here) be /too/ bad.

---
[1] usage=100: -musage=1..100 is the only way I've found to balance 
metadata without rebalancing system as well, with the unfortunate penalty 
for rebalancing system on small filesystems being an increase of the 
system chunk size from 8 MB original mkfs.btrfs size to 32 MB... only a 
few KiB used! =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56

2018-06-20 Thread Duncan

fixing that?

As the above should make clear, it's _not_ a question as simple as 
"interest"!

> I think it's the real missing part for a feature-complete filesystem.
> Nowadays parity raid is mandatory, we can't only rely on mirroring.

"Nowdays"?  "Mandatory"?

Parity-raid is certainly nice, but mandatory, especially when there's 
already other parity solutions (both hardware and software) available 
that btrfs can be run on top of, should a parity-raid solution be /that/ 
necessary?  Of course btrfs isn't the only next-gen fs out there, either, 
there's other solutions such as zfs available too, if btrfs doesn't have 
the features required at the maturity required.

So I'd like to see the supporting argument to parity-raid being mandatory 
for btrfs, first, before I'll take it as a given.  Nice, sure.  
Mandatory?  Call me skeptical.

---
[1] "Still cautious" use:  In addition to the raid56-specific reliability 
issues described above, as well as to cover Waxhead's referral to my 
usual backups advice:

Sysadmin's[2] first rule of data value and backups:  The real value of 
your data is not defined by any arbitrary claims, but rather by how many 
backups you consider it worth having of that data.  No backups simply 
defines the data as of such trivial value that it's worth less than the 
time/trouble/resources necessary to do and have at least one level of 
backup.

With such a definition, data loss can never be a big deal, because even 
in the event of data loss, what was defined as of most importance, the 
time/trouble/resources necessary to have a backup (or at least one more 
level of backup, in the event there were backups but they failed too), 
was saved.  So regardless of whether the data was recoverable or not, you 
*ALWAYS* save what you defined as most important, either the data if you 
had a backup to retrieve it from, or the time/trouble/resources necessary 
to make that backup, if you didn't have it because saving that time/
trouble/resources was considered more important than making that backup.

Of course the sysadmin's second rule of backups is that it's not a 
backup, merely a potential backup, until you've tested that you can 
actually recover the data from it in similar conditions to those under 
which you'd need to recover it.  IOW, boot to the backup or to the 
recovery environment, and be sure the backup's actually readable and can 
be recovered from using only the resources available in the recovery 
environment, then reboot back to the normal or recovered environment and 
be sure that what you recovered from the recovery environment is actually 
bootable or readable in the normal environment.  Once that's done, THEN 
it can be considered a real backup.

"Still cautious use" is simply ensuring that you're following the above 
rules, as any good admin will be regardless, and that those backups are 
actually available and recoverable in a timely manner should that be 
necessary.  IOW, an only backup "to the cloud" that's going to take a 
week to download and recover to, isn't "still cautious use", if you can 
only afford a few hours down time.  Unfortunately, that's a real life 
scenario I've seen people say they're in here more than once.

[2] Sysadmin:  As used here, "sysadmin" simply refers to the person who 
has the choice of btrfs, as compared to say ext4, in the first place, 
that is, the literal admin of at least one system, regardless of whether 
that's administering just their own single personal system, or thousands 
of systems across dozens of locations in some large corporation or 
government institution.

[3] Raid56 mode reliability implications:  For raid56 data, this isn't 
/that/ big of a deal, tho depending on what's in the rest of the stripe, 
it could still affect files not otherwise written in some time.  For 
metadata, however, it's a huge deal, since an incorrectly reconstructed 
metadata stripe could take out much or all of the filesystem, depending 
on what metadata was actually in that stripe.  This is where waxhead's 
recommendation to use raid1/10 for metadata even if using raid56 for data 
comes in.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-08 Thread Duncan

Marc Lehmann posted on Wed, 06 Jun 2018 21:06:35 +0200 as excerpted:

> Not sure what exactly you mean with btrfs mirroring (there are many
> btrfs features this could refer to), but the closest thing to that that
> I use is dup for metadata (which is always checksummed), data is always
> single. All btrfs filesystems are on lvm (not mirrored), and most (but
> not all) are encrypted. One affected fs is on a hardware raid
> controller, one is on an ssd. I have a single btrfs fs in that box with
> raid1 for metadata, as an experiment, but I haven't used it for testing
> yet.

On the off chance, tho it doesn't sound like it from your description...

You're not doing LVM snapshots of the volumes with btrfs on them, 
correct?  Because btrfs depends on filesystem GUIDs being just that, 
globally unique, using them to find the possible multiple devices of a 
multi-device btrfs (normal single-device filesystems don't have the issue 
as they don't have to deal with multi-device as btrfs does), and btrfs 
can get very confused, with data-loss potential, if it sees multiple 
copies of a device with the same filesystem GUID, as can happen if lvm 
snapshots (which obviously have the same filesystem GUID as the original) 
are taken and both the snapshot and the source are exposed to btrfs 
device scan (which is auto-triggered by udev when the new device 
appears), with one of them mounted.

Presumably you'd consider lvm snapshotting a form of mirroring and you've 
already said you're not doing that in any form, but just in case, because 
this is a rather obscure trap people using lvm could find themselves in, 
without a clue as to the danger, and the resulting symptoms could be 
rather hard to troubleshoot if this possibility wasn't considered.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID-1 refuses to balance large drive

2018-05-28 Thread Duncan

Brad Templeton posted on Sun, 27 May 2018 11:22:07 -0700 as excerpted:

> BTW, I decided to follow the original double replace strategy suggested 
--
> replace 6TB with 8TB and replace 4TB with 6TB.  That should be sure to
> leave the 2 large drives each with 2TB free once expanded, and thus able
> to fully use all space.
> 
> However, the first one has been going for 9 hours and is "189.7% done" 
> and still going.   Some sort of bug in calculating the completion
> status, obviously.  With luck 200% will be enough?

IIRC there was an over-100% completion status bug fixed, I'd guess about 
18 months to two years ago now, long enough it would have slipped 
regular's minds so nobody would have thought about it even knowing you're 
still on 4.4, that being one of the reasons we don't do as well 
supporting stuff that old.

If it is indeed the same bug, anything even half modern should have it 
fixed

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID-1 refuses to balance large drive

2018-05-26 Thread Duncan

 in each of the current and LTS tracks.  So as the first
release back from current 4.16, 4.15, tho EOLed upstream, is still
reasonably supported for the moment here, tho people should be
upgrading to 4.16 by now as 4.17 should be out in a couple weeks or
so and 4.15 would be out of the two-current-kernel-series window at that
time.

Meanwhile, the two latest LTS series are as already stated 4.14, and the
earlier 4.9.  4.4 is the one previous to that and it's still mainline
supported in general, but it's out of the two LTS-series window of best
support here, and truth be told, based on history, even supporting the
second newest LTS series starts to get more difficult at about a year and
a half out, 6 months or so before the next LTS comes out.  As it happens
that's about where 4.9 is now, and 4.14 has had about 6 months to
stabilize now, so for LTS I'd definitely recommend 4.14, now.

Of course that doesn't mean that we /refuse/ to support 4.4, we still
try, but it's out of primary focus now and in many cases, should you
have problems, the first recommendation is going to be try something
newer and see if the problem goes away or presents differently.  Or
as mentioned, check with your distro if it's a distro kernel, since
in that case they're best positioned to support it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: csum failed root raveled during balance

2018-05-23 Thread Duncan

ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:

>> IMHO the best course of action would be to disable checksumming for you
>> vm files.
>> 
>> 
> Do you mean '-o nodatasum' mount flag? Is it possible to disable
> checksumming for singe file by setting some magical chattr? Google
> thinks it's not possible to disable csums for a single file.

You can use nocow (-C), but of course that has other restrictions (like 
setting it on the files when they're zero-length, easiest done for 
existing data by setting it on the containing dir and copying files (no 
reflink) in) as well as the nocow effects, and nocow becomes cow1 after a 
snapshot (which locks the existing copy in place so changes written to a 
block /must/ be written elsewhere, thus the cow1, aka cow the first time 
written after the snapshot but retain the nocow for repeated writes 
between snapshots).

But if you're disabling checksumming anyway, nocow's likely the way to go.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: property: Set incompat flag of lzo/zstd compression

2018-05-15 Thread Duncan

Su Yue posted on Tue, 15 May 2018 16:05:01 +0800 as excerpted:

> 
> On 05/15/2018 03:51 PM, Misono Tomohiro wrote:
>> Incompat flag of lzo/zstd compression should be set at:
>>  1. mount time (-o compress/compress-force)
>>  2. when defrag is done 3. when property is set
>> 
>> Currently 3. is missing and this commit adds this.
>> 
>> 
> If I don't misunderstand, compression property of an inode is only apply
> for *the* inode, not the whole filesystem.
> So the original logical should be okay.

But the inode is on the filesystem, and if it's compressed with lzo/zstd, 
the incompat flag should be set to avoid mounting with an earlier kernel 
that doesn't understand that compression and would therefore, if we're 
lucky, simply fail to read the data compressed in that file/inode.  (If 
we're unlucky it could blow up with kernel memory corruption like James 
Harvey's current case of unexpected, corrupted compressed data in a nocow 
file that being nocow, doesn't have csum validation to fail and abort the 
decompression, and shouldn't be compressed at all.)

So better to set the incompat flag and refuse to mount at all on kernels 
that don't have the required compression support.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-11 Thread Duncan

Darrick J. Wong posted on Fri, 11 May 2018 17:06:34 -0700 as excerpted:

> On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote:
>> Right now we return EINVAL if a process does not have permission to dedupe a
>> file. This was an oversight on my part. EPERM gives a true description of
>> the nature of our error, and EINVAL is already used for the case that the
>> filesystem does not support dedupe.
>> 
>> Signed-off-by: Mark Fasheh <mfas...@suse.de>
>> ---
>>  fs/read_write.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 77986a2e2a3b..8edef43a182c 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -2038,7 +2038,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
>> file_dedupe_range *same)
>>  info->status = -EINVAL;
>>  } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
>>   uid_eq(current_fsuid(), dst->i_uid))) {
>> -info->status = -EINVAL;
>> +info->status = -EPERM;
> 
> Hmm, are we allowed to change this aspect of the kabi after the fact?
> 
> Granted, we're only trading one error code for another, but will the
> existing users of this care?  xfs_io won't and I assume duperemove won't
> either, but what about bees? :)

>From the 0/2 cover-letter:

>>> This has also popped up in duperemove, mostly in the form of cryptic
>>> error messages. Because this is a code returned to userspace, I did
>>> check the other users of extent-same that I could find. Both 'bees'
>>> and 'rust-btrfs' do the same as duperemove and simply report the error
>>> (as they should).

> --D
> 
>>  } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>>  info->status = -EXDEV;
>>  } else if (S_ISDIR(dst->i_mode)) {
>> -- 
>> 2.15.1
>>

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Duncan

Goffredo Baroncelli posted on Wed, 02 May 2018 22:40:27 +0200 as
excerpted:

> Anyway, my "rant" started when Ducan put near the missing of parity
> checksum and the write hole. The first might be a performance problem.
> Instead the write hole could lead to a loosing data. My intention was to
> highlight that the parity-checksum is not related to the reliability and
> safety of raid5/6.

Thanks for making that point... and to everyone else for the vigorous 
thread debating it, as I'm learning quite a lot! =:^)

>From your first reply:

>> Why the fact that the parity is not checksummed is a problem ?
>> I read several times that this is a problem. However each time the
>> thread reached the conclusion that... it is not a problem.

I must have missed those threads, or at least, missed that conclusion 
from them (maybe believing they were about something rather narrower, or 
conflating... for instance), because AFAICT, this is the first time I've 
seen the practical merits of checksummed parity actually debated, at 
least in terms I as a non-dev can reasonably understand.  To my mind it 
was settled (or I'd have worded my original claim rather differently) and 
only now am I learning different.

And... to my credit... given the healthy vigor of the debate, it seems 
I'm not the only one that missed them...

But I'm surely learning of it now, and indeed, I had somewhat conflated 
parity-checksumming with the in-place-stripe-read-modify-write atomicity 
issue.  I'll leave the parity-checksumming debate (now that I know it at 
least remains debatable) to those more knowledgeable than myself, but in 
addition to what I've learned of it, I've definitely learned that I can't 
properly conflate it with the in-place stripe-rmw atomicity issue, so 
thanks!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Duncan

Gandalf Corvotempesta posted on Wed, 02 May 2018 19:25:41 + as
excerpted:

> On 05/02/2018 03:47 AM, Duncan wrote:
>> Meanwhile, have you looked at zfs? Perhaps they have something like
>> that?
> 
> Yes, i've looked at ZFS and I'm using it on some servers but I don't
> like it too much for multiple reasons, in example:
> 
> 1) is not officially in kernel, we have to build a module every time
> with DKMS

FWIW zfz is excluded from my choice domain as well, due to the well known 
license issues.  Regardless of strict legal implications, because Oracle 
has copyrights they could easily solve that problem and the fact that 
they haven't strongly suggests they have no interest in doing so.  That 
in turn means they have no interest in people like me running zfs, which 
means I have no interest in it either.

But because it does remain effectively the nearest to btrfs features and 
potential features "working now" solution out there, for those who simply 
_must_ have it and/or find it a more acceptable solution than cobbling 
together a multi-layer solution out of a standard filesystem on top of 
device-mapper or whatever, it's what I and others point to when people 
wonder about missing or unstable btrfs features.

> I'm new to BTRFS (if fact, i'm not using it) and I've seen in the status
> page that "it's almost ready".
> The only real missing part is a stable, secure and properly working
> RAID56,
> so i'm thinking why most effort aren't directed to fix RAID56 ?

Well, they are.  But finding and fixing corner-case bugs takes time and 
early-adopter deployments, and btrfs doesn't have the engineering 
resources to simply assign to the problem that Sun had with zfs.

Despite that, as I stated, current btrfs raid56 is, to the best of my/
list knowledge, the current code is now reasonably ready, tho it'll take 
another year or two without serious bug reports to actually test that, 
but it simply has the well known write hole that applies to all parity-
raid unless they've taken specific measures such as partial-stripe-write 
logging (slow), writing a full stripe even if it's partially empty 
(wastes space and needs periodic maintenance to reclaim it), or variable-
stripe-widths (needs periodic maintenance and more complex than always 
writing full stripes even if they're partially empty) (both of the latter 
avoiding the problem by avoiding in-place read-modify-write cycle 
entirely).

So to a large degree what's left is simply time for testing to 
demonstrate stability on the one hand, and a well known problem with 
parity-raid in general on the other.  There's the small detail that said 
well-known write hole has additional implementation-detail implications 
on btrfs, but at it's root it's the same problem all parity-raid has, and 
people choosing parity-raid as a solution are already choosing to either 
live with it or ameliorate it in some other way (tho some parity-raid 
solutions have that amelioration built-in).

> There are some environments where a RAID1/10 is too expensive and a
> RAID6 is mandatory,
> but with the current state of RAID56, BTRFS can't be used for valuable
> data

Not entirely true.  Btrfs, even btrfs raid56 mode, _can_ be used for 
"valuable" data, it simply requires astute /practical/ definitions of 
"valuable", as opposed to simple claims that don't actually stand up in 
practice.

Here's what I mean:  The sysadmin's first rule of backups defines 
"valuable data" by the number of backups it's worth making of that data.  
If there's no backups, then by definition the data is worth less than the 
time/hassle/resources necessary to have that backup, because it's not a 
question of if, but rather when, something's going to go wrong with the 
working copy and it won't be available any longer.

Additional layers of backup and whether one keeps geographically 
separated off-site backups as well are simply extensions of the first-
level-backup case/rule.  The more valuable the data, the more backups 
it's worth having of it, and the more effort is justified in ensuring 
that single or even multiple disasters aren't going to leave no working 
backup.

With this view, it's perfectly fine to use btrfs raid56 mode for 
"valuable" data, because that data is backed up and that backup can be 
used as a fallback if necessary.  True, the "working copy" might not be 
as reliable as it is in some cases, but statistically, that simply brings 
the 50% chance of failure rate (or whatever other percentage chance you 
choose) closer, to say once a year, or once a month, rather than perhaps 
once or twice a decade.  Working copy failure is GOING to happen in any 
case, it's just a matter of playing the chance game as to when, and using 
a not yet fully demonstrated reliable filesystem mode simply brings ups 
the chances a bit.

But if the data really *is* defined as &quo

Re: RAID56 - 6 parity raid

2018-05-01 Thread Duncan

Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 + as
excerpted:

> Hi to all I've found some patches from Andrea Mazzoleni that adds
> support up to 6 parity raid.
> Why these are wasn't merged ?
> With modern disk size, having something greater than 2 parity, would be
> great.

1) Btrfs parity-raid was known to be seriously broken until quite 
recently (and still has the common parity-raid write-hole, which is more 
serious on btrfs because btrfs otherwise goes to some lengths to ensure 
data/metadata integrity via checksumming and verification, and the parity 
isn't checksummed, risking even old data due to the write hole, but there 
are a number of proposals to fix that), and piling even more not well 
tested patches on top was _not_ the way toward a solution.

2) Btrfs features in general have taken longer to merge and stabilize 
than one might expect, and parity-raid has been a prime example, with the 
original roadmap calling for parity-raid merge back in the 3.5 timeframe 
or so... partial/runtime (not full recovery) code was finally merged ~3 
years later in (IIRC) 3.19, took several development cycles for the 
initial critical bugs to be worked out but by 4.2 or so was starting to 
look good, then more bugs were found and reported, that took several more 
years to fix, tho IIRC LTS-4.14 has them.

Meanwhile, consider that N-way-mirroring was fast-path roadmapped for 
"right after raid56 mode, because some of its code depends on that), so 
was originally expected in 3.6 or so...  As someone who had been wanting 
to use /that/, I personally know the pain of "still waiting".

And that was "fast-pathed".

So even if the multi-way-parity patches were on the "fast" path, it's 
only "now" (for relative values of now, for argument say by 4.20/5.0 or 
whatever it ends up being called) that such a thing could be reasonably 
considered.


3) AFAIK none of the btrfs devs have flat rejected the idea, but btrfs 
remains development opportunity rich and implementing dev poor... there's 
likely 20 years or more of "good" ideas out there.  And the N-way-parity-
raid patches haven't hit any of the current devs' (or their employers') 
"personal itch that needs to be scratched" interest points, so while it 
certainly does remain a "nice idea", given the implementation timeline 
history for even 'fast-pathed" ideas, realistically we're looking at at 
least a decade out.  But with the practical projection horizon no more 
than 5-7 years out (beyond that other, unpredicted, developments, are 
likely to change things so much that projection is effectively 
impossible), in practice, a decade out is "bluesky", aka "it'd be nice to 
have someday, but it's not a priority, and with current developer 
manpower, it's unlikely to happen any time in the practically projectable 
future.

4) Of course all that's subject to no major new btrfs developer (or 
sponsor) making it a high priority, but even should such a developer (and/
or sponsor) appear, they'd probably need to spend at least two years 
coming up to speed with the code first, fixing normal bugs and improving 
the existing code quality, then post the updated and rebased N-way-parity 
patches for discussion, and get them roadmapped for merge probably some 
years later due to other then-current project feature dependencies.

So even if the N-way-parity patches became some new developer's (or 
sponsor's) personal itch to scratch, by the time they came up to speed 
and the code was actually merged, there's no realistic projection that it 
would be in under 5 years, plus another couple to stabilize, so at least 
7 years to properly usable stability.  So even then, we're already at the 
5-7 years practical projectability limit.


Meanwhile, have you looked at zfs?  Perhaps they have something like 
that?  And there's also a new(?) one, stratis, AFAIK commercially 
sponsored and device-mapper based, that I saw an article on recently, tho 
I've seen/heard no kernel-community discussion on it (there's a good 
chance followup here will change that if it's worth discussing, as 
there's several folks here for whom knowing about such things is part of 
their job) and no other articles (besides the pt 1 of the series 
mentioned below), so for all I know it's pie-in-the-sky or still new 
enough it'd be 5-7 years before it can be used in practice, as well.  But 
assuming it's a viable project, presumably it would get support if device-
mapper did/has.

The stratis article I saw (apparently part 2 in a series):
https://opensource.com/article/18/4/stratis-lessons-learned

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NVMe SSD + compression - benchmarking

2018-04-29 Thread Duncan

Brendan Hide posted on Sat, 28 Apr 2018 09:30:30 +0200 as excerpted:

> My real worry is that I'm currently reading at 2.79GB/s (see result
> above and below) without compression when my hardware *should* limit it
> to 2.0GB/s. This tells me either `sync` is not working or my benchmark
> method is flawed.

No answer but a couple additional questions/suggestions:

* Tarfile:  Just to be sure, you're using an uncompressed tarfile, not a 
(compressed tarfile) tgz/tbz2/etc, correct?

* How does hdparm -t and -T compare?  That's read-only and bypasses the 
filesystem, so it should at least give you something to compare the 2.79 
GB/s to, both from-raw-device (-t) and cached/memory-only (-T).  See the 
hdparm (8) manpage for the details.

* And of course try the compressed tarball too, since it should be easy 
enough and should give you compressable vs. uncompressable numbers for 
sanity checking.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What is recommended level of btrfs-progs and kernel please

2018-04-29 Thread Duncan

David C. Partridge posted on Sat, 28 Apr 2018 15:09:07 +0100 as excerpted:

> To what level of btrfs-progs do you recommend I should upgrade once my
> corrupt FS is fixed?  What is the kernel pre-req for that?
> 
> Would prefer not to build from source ... currently running Ubuntu
> 16.04LTS

The way it works is as follows:

In normal operation, the kernel does most of the work, with commands such 
as balance and scrub simply making the appropriate calls to the kernel to 
do the real work.  So the kernel version is what's critical in normal 
operation.  (IIRC, the receive side of btrfs send/receive is an 
exception, userspace is doing the work there, tho the kernel does it on 
the send side.)

This list is mainline and forward-looking development focused, so 
recommended kernels, the ones people here are most familiar with, tend to 
be relatively new.  The two support tracks are current and LTS, and we 
try to support the latest two kernels of each.  On the current kernel 
track, 4.16 is the latest, so the 4.16 and 4.15 series are currently 
supported.  On the LTS track, 4.14 is the newest LTS series and is 
recommended, with 4.9 the previous one, still supported, tho as it gets 
older and memories of what was going on at the time fade, it gets harder 
to support.

That doesn't mean we don't try to help people with older kernels, but 
truth is, the best answer may well be "try it with a newer kernel and see 
if the problem persists".

Similarly for distro kernels, particularly older ones.  We track mainline 
and in general[1] have little idea what patches specific distros may have 
backported... or not.  With newer kernels there's not so much to backport, 
and hopefully none of their added patches actually interferes, but 
particularly outside the mainline LTS series kernels, and older than the 
second newest LTS series kernel for the real LTS distros, the distros 
themselves are choosing what to backport and support, and thus are in a 
better position to support those kernels than we on this list will be.


But when something goes wrong and you need to use the debugging tools or 
btrfs check or restore, it's the btrfs userspace (btrfs-progs) that is 
doing the work, so it becomes the most critical when you have a problem 
you are trying to find/repair/restore-from.

So in normal operation, userspace isn't critical, and the biggest problem 
is simply keeping it current enough that the output remains comparable to 
current output.  With btrfs userspace release numbering following that of 
the kernel, for operational use, a good rule of thumb is to keep 
userspace updated to at least the version of the oldest supported LTS 
kernel series, as mentioned 4.9 at present, thus keeping it at least 
within approximately two years of current.

But once something goes wrong, the newest available userspace, or close 
to it, has the latest fixes, and generally provides the best chance at a 
fix with the least hassle or chance of further breakage instead.  So 
there, basically something within the current track, above, thus 
currently at least a 4.15 if not a 4.16 userspace (btrfs-progs) is your 
best bet.

And often the easiest way to get that if your distro doesn't make it 
directly available, is to make it a point to keep around the latest 
LiveRescue (often install/rescue combined) image of a distro such as 
Fedora or Arch that stays relatively current.  That's often the newest or 
close enough, and if it's not, it at least gives you a way to get back 
online to fetch something newer after booting the rescue image, if you 
have to.

---
[1] In general:  I think one regular btrfs dev works with SuSE, and one 
non-dev but well-practiced support list regular is most familiar with 
Fedora, tho of course Fedora doesn't to be /too/ outdated.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: status page

2018-04-25 Thread Duncan

ch, that logged-write penalty on top of the read-modify-
write penalty that short-stripe-writes on parity-raid already incurs, 
will really do a number to performance!  But it /should/ finally fix the 
write hole risk, and it'd be the fastest way to do it on top of existing 
code, with the least risk of additional bugs because it's the least new 
code to write.


What I personally suspect will happen is this last solution in the 
shorter term, tho it'll still take some years to be written and tested to 
stability, with the possibility of someone undertaking a btrfs parity-
raid-g2 project implementing the first/cleanest possibility in the longer 
term, say a decade out (which effectively means "whenever someone with 
the skills and motivation decides to try it, could be 5 years out if they 
start today and devote the time to it, could be 15 years out, or never, 
if nobody ever decides to do it).  I honestly don't see the intermediate 
possibilities as worth the trouble, as they'd take too long for not 
enough payback compared to the solutions at either end, but of course, 
someone might just come along that likes and actually implements that 
angle instead.  As always with FLOSS, the one actually doing the 
implementation is the one who decides (subject to maintainer veto, of 
course, and possible distro and ultimate mainlining of the de facto 
situation override of the maintainer, as well).


A single paragraph summary answer?

Current raid56 status-quo is semi-stable, and subject to testing over 
time, is likely to remain there for some time, with the known parity-raid 
write-hole caveat as the biggest issue.  There's discussion of attempts 
to mitigate the write-hole, but the final form such mitigation will take 
remains to be settled, and the shortest-to-stability alternative, logged 
partial-stripe-writes, has serious performance negatives, but that might 
be acceptable given that parity-raid already has read-modify-write 
performance issues so people don't choose it for write performance in any 
case.  That'd be probably 3 years out to stability at the earliest.  
There's a cleaner alternative but it'd be /much/ farther out as it'd 
involve a pretty heavy rewrite along with the long testing and bugfix 
cycle that implies, so ~10 years out if ever, for that.  And there's a 
couple intermediate alternatives as well, but unless something changes I 
don't really see them going anywhere.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs progs release 4.16.1

2018-04-25 Thread Duncan

David Sterba posted on Wed, 25 Apr 2018 13:02:34 +0200 as excerpted:

> On Wed, Apr 25, 2018 at 06:31:20AM +0000, Duncan wrote:
>> David Sterba posted on Tue, 24 Apr 2018 13:58:57 +0200 as excerpted:
>> 
>> > btrfs-progs version 4.16.1 have been released.  This is a bugfix
>> > release.
>> > 
>> > Changes:
>> > 
>> >   * remove obsolete tools: btrfs-debug-tree, btrfs-zero-log,
>> >   btrfs-show-super, btrfs-calc-size
>> 
>> Cue the admin-side gripes about developer definitions of micro-upgrade
>> explicit "bugfix release" that allow disappearance of "obsolete tools".
>> 
>> Arguably such removals can be expected in a "feature release", but
>> shouldn't surprise unsuspecting admins doing a micro-version upgrade
>> that's specifically billed as a "bugfix release".
> 
> A major version release would be a better time for the removal, I agree
> and should have considered that.
> 
> However, the tools have been obsoleted for a long time (since 2015 or
> 2016) so I wonder if the deprecation warnings have been ignored by the
> admins all the time.

Indeed, in practice, anybody still using the stand-alone tools in a 
current version has been ignoring deprecation warnings for awhile, and 
the difference between 4.16.1 and 4.17(.0) isn't likely to make much of a 
difference to them.

It's just that from here anyway, if I did a big multi-version upgrade and 
saw tools go missing I'd expect it, and if I did an upgrade from 4.16 to 
4.17 I'd expect it and blame myself for not getting with the program 
sooner.  But on an upgrade from 4.16 to 4.16.1, furthermore, an explicit 
"bugfix release", I'd be annoyed with upstream when they went missing, 
because it's just not expected in such a minor release, particularly when 
it's an explicit "bugfix release".

>> (Further support for btrfs being "still stabilizing, not yet fully
>> stable and mature."  But development mode habits need to end
>> /sometime/, if stability is indeed a goal.)
> 
> What happened here was a bad release management decision, a minor one in
> my oppinion but I hear your complaint and will keep that in mind for
> future releases.

That's all I was after.  A mere trifle indeed in the filesystem context 
where there's a real chance that bugs can eat data, but equally trivially 
held off for a .0 release.  What's behind is done, but it can and should 
be used to inform the future, and I simply mentioned it here with the 
goal /of/ informing future release decisions.  To the extent that it does 
so, my post accomplished its purpose. =:^)

Seems my way of saying that ended up coming across way more negative than 
intended.  So I have some changes to make in the way I handle things in 
the future as well. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs progs release 4.16.1

2018-04-25 Thread Duncan

David Sterba posted on Tue, 24 Apr 2018 13:58:57 +0200 as excerpted:

> btrfs-progs version 4.16.1 have been released.  This is a bugfix
> release.
> 
> Changes:
> 
>   * remove obsolete tools: btrfs-debug-tree, btrfs-zero-log,
>   btrfs-show-super, btrfs-calc-size

Cue the admin-side gripes about developer definitions of micro-upgrade 
explicit "bugfix release" that allow disappearance of "obsolete tools".

Arguably such removals can be expected in a "feature release", but 
shouldn't surprise unsuspecting admins doing a micro-version upgrade 
that's specifically billed as a "bugfix release".

(Further support for btrfs being "still stabilizing, not yet fully stable 
and mature."  But development mode habits need to end /sometime/, if 
stability is indeed a goal.) 

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recovery from full metadata with all device space consumed?

2018-04-20 Thread Duncan

; # btrfs fi show /broken Label: 'mon_data'  uuid:
>> 85e52555-7d6d-4346-8b37-8278447eb590
>> Total devices 4 FS bytes used 69.50GiB
>> devid1 size 931.51GiB used 931.51GiB path /dev/sda1
>> devid2 size 931.51GiB used 931.51GiB path /dev/sdb1
>> devid3 size 931.51GiB used 931.51GiB path /dev/sdc1
>> devid4 size 931.51GiB used 931.51GiB path /dev/sdd1

As you suggest, all space on all devices is used.  While fi usage breaks 
out unallocated as its own line-item, both per device and overall, with
fi show/df you have to derive it from the difference between size and 
used on each device listed in the fi show report.

If (after getting it that way with balance) you keep fi show per-device 
used under say 250 or 500 MiB, that'll go to unallocated, as fi usage 
will make clearer.

Meanwhile, for fi df, that data line says 3.6+ TiB total data chunk 
allocations, but only 67 GiB used.  As I said, that's ***WAY*** out of 
whack, and getting it back into something a bit more normal and keeping 
it there, for under 100 GiB actually used, say under say 250 or 500 GiB 
total, with the rest returned to unallocated, dropping the used in the fi 
df report and increasing unallocated in fi usage, should keep you well 
out of trouble.

As for fi usage, While I use a bunch of much smaller filesystems here, 
all raid1 or dup, so it'll be of limited direct help, I'll post the 
output from one of mine, just so you can see how much easier it is to 
read the fi usage report:

$$ sudo btrfs filesystem usage /
Overall:
Device size:  16.00GiB
Device allocated:  7.02GiB
Device unallocated:8.98GiB
Device missing:  0.00B
Used:  4.90GiB
Free (estimated):  5.25GiB  (min: 5.25GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID1: Size:3.00GiB, Used:2.24GiB
   /dev/sda5   3.00GiB
   /dev/sdb5   3.00GiB

Metadata,RAID1: Size:512.00MiB, Used:209.59MiB
   /dev/sda5 512.00MiB
   /dev/sdb5 512.00MiB

System,RAID1: Size:8.00MiB, Used:16.00KiB
   /dev/sda5   8.00MiB
   /dev/sdb5   8.00MiB

Unallocated:
   /dev/sda5   4.49GiB
   /dev/sdb5   4.49GiB

(FWIW there's also btrfs device usage, if you want a device-focused 
report.)

This is a btrfs raid1 both data and metadata, on a pair of 8 GiB devices, 
thus 16 GiB total.

Of that 8 GiB per device, a very healthy 4.49 GiB per device, over half 
the filesystem, remains entirely chunk-level unallocated and thus free to 
allocate to data or metadata chunks as needed.

Meanwhile, data chunk allocation is 3 GiB total per device, of which 2.24 
GiB is used.  Again, that's healthy, as data chunks are nominally 1 GiB 
so that's probably three 1 GiB chunks allocated, with 2.24 GiB of it used.

By contrast, your in-trouble fi usage report will show (near) 0 
unallocated and a ***HUGE*** gap between size/total and used for data, 
while you should be easily able to get per-device data totals down to say 
250 GiB or so (or down to 10 GiB or so with more work), with it all 
switching to unallocated, and then keep it healthy by doing a balance 
with -dusage= as necessary any time the numbers start getting out of line 
again.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: remounted ro during operation, unmountable since

2018-04-15 Thread Duncan

Qu Wenruo posted on Sat, 14 Apr 2018 22:41:50 +0800 as excerpted:

>> sectorsize        4096
>> nodesize        4096
> 
> Nodesize is not the default 16K, any reason for this?
> (Maybe performance?)
> 
>>> 3) Extra hardware info about your sda
>>>     Things like SMART and hardware model would also help here.

>> Model Family: Samsung based SSDs Device Model: SAMSUNG SSD 830
>> Series
> 
> At least I haven't hear much problem about Samsung SSD, so I don't think
> it's the hardware to blamce. (Unlike Intel 600P)

830 model is a few years old, IIRC (I have 850s, and I think I saw 860s 
out in something I read probably on this list, but am not sure of it).  I 
suspect the filesystem was created with an old enough btrfs-tools that 
the default nodesize was still 4K, either due to older distro, or simply 
due to using the filesystem that long.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs fails to mount after power outage

2018-04-12 Thread Duncan

Qu Wenruo posted on Thu, 12 Apr 2018 07:25:15 +0800 as excerpted:

> On 2018年04月11日 23:33, Tom Vincent wrote:
>> My btrfs laptop had a power outage and failed to boot with "parent
>> transid verify failed..." errors. (I have backups).
> 
> Metadata corruption, again.
> 
> I'm curious about what's the underlying disk?
> Is it plain physical device? Or have other layers like bcache/lvm?
> 
> And what's the physical device? SSD or HDD?

The last line of his message said progs 4.15, kernel 4.15.15, NVMe, so 
it's SSD.

Another important question, tho, if not for this instance, than for 
easiest repair the next time something goes wrong:

What mount options?  In particular, is the discard option used (and of 
course I'm assuming nothing as insane as nobarrier)?

Because as came up on a recent thread here...

Btrfs normally keeps a few generations of root blocks around and one 
method of recovery is using the usebackuproot (or the deprecated 
recovery) option to try to use them if the current root is bad.  But 
apparently nobody considered how discard and the backup roots would 
interact, and there's (currently) nothing keeping them from being marked 
for discard just as soon as the next new root becomes current.  Now some 
device firmware batches up discards as garbage-collection that can be 
done periodically, when the number of unwritten erase-blocks gets low, 
but others do discards basically immediately, meaning those backup roots 
are lost effectively immediately, making the usebackuproots recovery 
feature worthless. =:^(

Not a tradeoff that would occur to most people, obviously including the 
btrfs devs that setup btrfs discard behavior, considering whether to 
enable discard or not. =:^(

But it's definitely a tradeoff to consider once you /do/ know it!

Presumably that'll be fixed at some point, but not being a dev nor 
knowing how complex the fix might be, I won't venture a guess as to when, 
or whether it'd be considered stable-kernel backport material or not, 
when it happens.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Out of space and incorrect size reported

2018-03-22 Thread Duncan

Shane Walton posted on Thu, 22 Mar 2018 00:56:05 + as excerpted:

>>>> btrfs fi df /mnt2/pool_homes
>>> Data, RAID1: total=240.00GiB, used=239.78GiB
>>> System, RAID1: total=8.00MiB, used=64.00KiB
>>> Metadata, RAID1: total=8.00GiB, used=5.90GiB
>>> GlobalReserve, single: total=512.00MiB, used=59.31MiB
>>> 
>>>> btrfs filesystem show /mnt2/pool_homes
>>> Label: 'pool_homes'  uuid: 0987930f-8c9c-49cc-985e-de6383863070
>>> Total devices 2 FS bytes used 245.75GiB
>>> devid1 size 465.76GiB used 248.01GiB path /dev/sdaThe output 
from the (relatively new and thus possibly not yet in the old 4.4 you 
posted with above and upgraded from) btrfs filesystem usage command makes 
this somewhat clearer, tho 
>>> devid2 size 465.76GiB used 248.01GiB path /dev/sdb
>>> 
>>> Why is the line above "Data, RAID1: total=240.00GiB, used=239.78GiB
>>> almost full and limited to 240 GiB when there is I have 2x 500 GB HDD?

>>> What can I do to make this larger or closer to the full size of 465
>>> GiB (minus the System and Metadata overhead)?

By my read, Hugo answered correctly, but (I think) not the question you 
asked.

The upgrade was certainly a good idea, 4.4 being quite old now and not 
really supported well here now, as this is a development list and we tend 
to be focused on new, not long ago history, but it didn't change the 
report output as you expected, because based on your question you're 
misreading it and it doesn't say what you are interpreting it as saying.

BTW, you might like the output from btrfs filesystem usage a bit better 
as it's somewhat clearer than the previously required (usage is a 
relatively new subcommand that might not have been in 4.4 yet) btrfs fi 
df and btrfs fi show, but understanding how btrfs works and what the 
reported numbers mean is still useful.

Btrfs does two-stage allocation.  First, it allocates chunks of a 
specific type, normally data or metadata (system is special, normally 
only one chunk so no more allocated, and global reserve is actually 
reserved from metadata and counts as part of it) from unused/unallocated 
space (which isn't shown by show/df, but usage shows it separately), then 
when necessary, btrfs actually uses space from the chunks it allocated 
previously.

So what the above df line is saying is that 240 GiB of space have been 
allocated as data chunks, and 239.78 GiB of that, almost all of it, is 
used.

But you should still have 200+ GiB of unallocated space on each of the 
devices, as here shown by the individual device lines of the show command 
(465 total, 248 used), tho as I said, btrfs filesystem usage makes that 
rather clearer.

And btrfs should normally allocate additional space from that 200+ gigs 
unallocated, to data or metadata chunks, as necessary.  Further, because 
btrfs can't directly take chunks allocated as data and reallocate them as 
metadata, you *WANT* lots of unallocated space.  You do NOT want all that 
extra space allocated as data chunks, because then they wouldn't be 
available to allocate as metadata if needed.

Now with 200+ GiB of space on each of the two devices unallocated, you 
shouldn't yet be running into ENOSPC (error no space) errors.  If you 
are, that's a bug, and there have actually been a couple bugs like that 
recently, but that doesn't mean you want btrfs to unnecessarily allocate 
all that unallocated space as data space, which would be what it did if 
it reported all that as data.  Rather, you need btrfs to allocate data, 
and metadata, chunks as needed, and any space related errors you are 
seeing would be bugs related to that.

Now that you have a newer btrfs-progs and kernel, and have read my 
attempt at an explanation above, try btrfs filesystem usage and see if 
things are clearer.  If not, maybe Hugo or someone else can do better 
now, answering /that/ question.  And of course if with the newer 4.12 
kernel you're getting ENOSPC errors, please report that too, tho be aware 
that 4.14 is the latest LTS series, with 4.9 the LTS before that, and as 
a normal non-LTS series kernel 4.12 support has ended as well, so you 
might wish to either upgrade to a current 4.14 LTS or downgrade to the 
older 4.9 LTS, for best support.

Or of course you could go with a current non-LTS.  Normally the latest 
two release series in both normal and LTS are best supported, so with 
4.15 out and 4.16 nearing release, that's the latest 4.15 stable release 
now, or 4.14, to be 4.16 and 4.15 at 4.16 release, or on the LTS track 
the previously mentioned 4.14 and 4.9 series, tho at a year old plus, 4.9 
is already getting rather harder to support, and 4.14 is old enough now 
it's preferred for LTS track.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master.&qu

Re: grub_probe/grub-mkimage does not find all drives in BTRFS RAID1

2018-03-22 Thread Duncan

kept it!

But in addition to two-way raid1 redundancy on multiple devices, btrfs 
has the dup mode, two-way dup redundancy on a single device, so that's 
what I do with my /boot and its backups on other devices now, instead of 
making them raid1s across multiple devices.

So while most of my filesystems and their backups are btrfs raid1 both 
data and metadata across two physical devices (with another pair of 
physical devices for the btrfs raid1 backups), /boot and its backups are 
all btrfs dup mixed-bg-mode (so data and metadata mixed, easier to work 
with on small filesystems), giving me one primary /boot and three 
backups, and I can still select which one to boot from the hardware/BIOS 
(legacy not EFI mode, tho I do use GPT and have EFI-boot partitions 
reserved in case I decide to switch to EFI at some point).


So my suggestion would be to do something similar, multiple /boot, one 
per device, one as the working copy and the other(s) as backups, instead 
of btrfs raid1 across multiple devices.  If you still want to take 
advantage of btrfs' ability to error-correct from a second copy if the 
first fails checksum, as I do, btrfs dup mode is useful, but regardless, 
you'll then have a backup in case the working /boot entirely fails.  Tho 
of course with dup mode you can only use a bit under half the capacity.

Your btrfs fi show says 342 MB used (as data) of the 1 GiB, so dup mode 
should be possible as you'd have a bit under 500 MiB capacity then.  Your 
individual devices say nearly 700 MiB each used, but with only 342 MiB of 
that as data, the rest is likely partially used chunks that a filtered 
balance can take care of.  A btrfs fi usage report would tell the details 
(or btrfs fi df, combined with the show you've already posted).  At a 
GiB, creating the filesystem as mixed-mode is also recommended, tho that 
does make a filtered balance a bit more of a hassle since you have to use 
the same filters for both data and metadata because they're the same 
chunks.

FWIW, I started out with 256 MiB /boot here, btrfs dup mode so ~ 100 MiB 
usable, but after ssd upgrades and redoing the layout, now use 512 MiB 
/boots, for 200+ MiB usable.  That's better.  Your 1 GiB doubles that, so 
should be no trouble at all, even with dup, unless you're storing way 
more in /boot than I do.  (Being gentoo I do configure and run a rather 
slimmer custom initramfs and monolithic kernel configured for only the 
hardware and dracut initr* modules I need, and a fatter generic initr* 
and kernel modules would likely need more space, but your show output 
says it's only using 342 MiB for data, so as I said your 1 GiB for ~500 
MiB usable in dup mode should be quite reasonable.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid1 volume stuck as read-only: How to dump, recreate and restore its content?

2018-03-15 Thread Duncan

Piotr Pawłow posted on Tue, 13 Mar 2018 08:08:27 +0100 as excerpted:

> Hello,
>> Put differently, 4.7 is missing a year and a half worth of bugfixes
>> that you won't have when you run it to try to check or recover that
>> btrfs that won't mount! Do you *really* want to risk your data on bugs
>> that were after all discovered and fixed over a year ago?
> 
> It is also missing newly introduced bugs. Right now I'm dealing with
> btrfs raid1 server that had the fs getting stuck and kernel oopses due
> to a regression:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=198861
> 
> I had to cherry-pick commit 3be8828fc507cdafe7040a3dcf361a2bcd8e305b and
> recompile the kernel to even start moving the data off the failing
> drive, as the fix is not in stable yet, and encountering any i/o error
> would break the kernel. And now it seems the fs is corrupted, maybe due
> to all the crashes earlier.
> 
> FYI in case you decide to switch to 4.15

In context I was referring to userspace as the 4.7 was userspace btrfs-
progs, not kernelspace.

For kernelspace he was on 4.9, which is the second-newest LTS (long-term-
stable) kernel series, and thus should continue to be at least somewhat 
supported on this list for another year or so, as we try to support the 
two newest kernels from both the current and LTS series.  Tho 4.9 does 
lack the newer raid1 per-chunk degraded-writable scanning feature, and 
AFAIK that won't be stable-backported as it's more a feature than a bugfix 
and as such, doesn't meet the requirements for stable-series backports.  
Which is why Adam recommended a newer kernel, since that was the 
particular problem needing addressed here.

But for someone on an older kernel, presumably because they like 
stability, I'd suggest the newer 4.14 LTS series kernel as an upgrade, 
not the only short-term supported 4.15 series... unless the intent is to 
continue staying current after that, with 4.16, 4.17, etc.  Which your 
point about newer kernels coming with newer bugs in addition to fixes 
supports as well.  Moving to the 4.14 LTS should get the real fixes and 
the longer stabilization time, tho not the feature adds, which would 
bring a higher chance of more bugs, as well.

And with 4.15 out for awhile now and 4.16 close, 4.14 should be 
reasonably stabilizing by now and should be pretty safe to move to.

But of course there's some risk of new bugs in addition to fixes for 
newer userspace versions too.  But since it's kernelspace that's the 
operational code and userspace is primarily recovery, and we know that 
older bugs ARE fixed in newer userspace, and assuming a sane backups 
policy which I stressed in the same post (if you don't have a backup, 
you're defining the data as of less value than the time/trouble/resources 
to create the backup, thus defining it as of relatively low/trivial value 
in the first place, because you're more willing to risk losing it than 
you are to spend the time/resources/hassle to ensure against that risk), 
the better chance at an updated userspace being able to fix problems with 
less risk of further damage really does justify considering updating to 
reasonably current userspace.  If there's any doubt, stay a version or 
two behind the latest release and watch for reports of problems with the 
latest, but certainly, with 4.15 userspace out and no serious reports of 
new damage from 4.14 userspace, the latter should now be a reasonably 
safe upgrade.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid1 volume stuck as read-only: How to dump, recreate and restore its content?

2018-03-12 Thread Duncan

nd am able to sleep much more comfortably now as I'm not worrying 
about that backup I put off and the chance fate might take me up on my 
formerly too-high-for-comfort "trivial" threshold definition.=:^)

(And as it happens, I'm actually running from a system/root filesystem 
backup ATM, as an upgrade didn't go well and x wouldn't start, so I 
reverted.  But my root/system filesystem is under 10 gigs, on SSD for the 
backup as well as the working copy, so a full backup copy of root takes 
only a few minutes and I made one before upgrading a few packages I had 
some doubts about due to previous upgrade issues with them, so the delta 
between working and that backup was literally the five package upgrades I 
was it turned out rightly worried about.  So that investment in ssds for 
backup has paid off.  While in this particular case simply taking a 
snapshot and recovering to it when the upgrade went bad would have worked 
just as well, having the independent filesystem backup on a different set 
of physical devices means I don't have to worry about loss of the 
filesystem or physical devices containing it, either! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to replace a failed drive in btrfs RAID 1 filesystem

2018-03-10 Thread Duncan

Andrei Borzenkov posted on Sat, 10 Mar 2018 13:27:03 +0300 as excerpted:


> And "missing" is not the answer because I obviously may have more than
> one missing device.

"missing" is indeed the answer when using btrfs device remove.  See the 
btrfs-device manpage, which explains that if there's more than one device 
missing, either just the first one described by the metadata will be 
removed (if missing is only specified once), or missing can be specified 
multiple times.

raid6 with two devices missing is the only normal candidate for that 
presently, tho on-list we've seen aborted-add cases where it still worked 
as well, because while the metadata listed the new device it didn't 
actually have any data when it became apparent it was bad and thus needed 
to be removed again.

Note that because btrfs raid1 and raid10 only does two-way-mirroring 
regardless of the number of devices, and because of the per-chunk (as 
opposed to per-device) nature of btrfs raid10, those modes can only 
expect successful recovery with a single missing device, altho as 
mentioned above we've seen on-list at least one case where an aborted 
device-add of device found to be bad after the add didn't actually have 
anything on it, so it could still be removed along with the device it was 
originally intended to replace.

Of course the N-way-mirroring mode, whenever it eventually gets 
implemented, will allow missing devices upto N-1, and N-way-parity mode, 
if it's ever implemented, similar, but N-way-mirroring was scheduled for 
after raid56 mode so it could make use of some of the same code, and that 
has of course taken years on years to get merged and stabilize, and 
there's no sign yet of N-way-mirroring patches, which based on the raid56 
case could take years to stabilize and debug after original merge, so the 
still somewhat iffy raid6 mode is likely to remain the only normal usage 
of multiple missing for years, yet.

For btrfs replace, the manpage says ID's the only way to handle missing, 
but getting that ID, as you've indicated, could be difficult.  For 
filesystems with only a few devices that haven't had any or many device 
config changes, it should be pretty easy to guess (a two device 
filesystem with no changes should have IDs 1 and 2, so if only one is 
listed, the other is obvious, and a 3-4 device fs with only one or two 
previous device changes, likely well remembered by the admin, should 
still be reasonably easy to guess), but as the number of devices and the 
number of device adds/removes/replaces increases, finding/guessing the 
missing one becomes far more difficult.

Of course the sysadmin's first rule of backups states in simple form that 
not having one == defining the value of the data as trivial, not worth 
the trouble of a backup, which in turn means that at some point before 
there's /too/ many device change events, it's likely going to be less 
trouble (particularly after factoring in reliability) to restore from 
backups to a fresh filesystem than it is to do yet another device change, 
and together with the current practical limits btrfs imposes on the 
number of missing devices, that tends to impose /some/ limit on the 
possibilities for missing device IDs, so the situation, while not ideal, 
isn't yet /entirely/ out of hand, either, because a successful guess 
based on available information should be possible without /too/ many 
attempts.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: spurious full btrfs corruption

2018-03-06 Thread Duncan

Christoph Anton Mitterer posted on Tue, 06 Mar 2018 01:57:58 +0100 as
excerpted:

> In the meantime I had a look of the remaining files that I got from the
> btrfs-restore (haven't run it again so far, from the OLD notebook, so
> only the results from the NEW notebook here:):
> 
> The remaining ones were multi-GB qcow2 images for some qemu VMs.
> I think I had non of these files open (i.e. VMs running) while in the
> final corruption phase... but at least I'm sure that not *all* of them
> were running.
> 
> However, all the qcow2 files from the restore are more or less garbage.
> During the btrfs-restore it already complained on them, that it would
> loop too often on them and whether I want to continue or not (I choose n
> and on another full run I choose y).
> 
> Some still contain a partition table, some partitions even filesystems
> (btrfs again)... but I cannot mount them.

Just a note on format choices FWIW, nothing at all to do with your 
current problem...

As my own use-case doesn't involve VMs I'm /far/ from an expert here, but 
if I'm screwing things up I'm sure someone will correct me and I'll learn 
something too, but it does /sound/ reasonable, so assuming I'm 
remembering correctly from a discussion here...

Tip: Btrfs and qcow2 are both copy-on-write/COW (it's in the qcow2 name, 
after all), and doing multiple layers of COW is both inefficient and a 
good candidate to test for corner-case bugs that wouldn't show up in 
more normal use-cases.  Assuming bug-free it /should/ work properly, of 
course, but equally of course, bug-free isn't an entirely realistic 
assumption. =8^0

... And you're putting btrfs on qcow2 on btrfs... THREE layers of COW!

The recommendation was thus to pick what layer you wish to COW at, and 
use something that's not COW-based at the other layers.  Apparently, qemu 
has raw-format as a choice as well as qcow2, and that was recommended as 
preferred for use with btrfs (and IIRC what the recommender was using 
himself).

But of course that still leaves cow-based btrfs on both the top and the 
bottom layers.  I suppose which of those is best to remain btrfs, while 
making the other say ext4 as widest used and hopefully safest general 
purpose non-COW alternative, depends on the use-case.

Of course keeping btrfs at both levels but nocowing the image files on 
the host btrfs is a possibility as well, but nocow on btrfs has enough 
limits and caveats that I consider it a second-class "really should have 
used a different filesystem for this but didn't want to bother setting up 
a dedicated one" choice, and as such, don't consider it a viable option 
here.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs space used issue

2018-03-02 Thread Duncan

vinayak hegde posted on Thu, 01 Mar 2018 14:56:46 +0530 as excerpted:

> This will happen over and over again until we have completely
> overwritten the original extent, at which point your space usage will go
> back down to ~302g.We split big extents with cow, so unless you've got
> lots of space to spare or are going to use nodatacow you should probably
> not pre-allocate virt images

Indeed.  Preallocation with COW doesn't make the sense it does on an 
overwrite-in-place filesystem.  Either nocow it and take the penalties 
that brings[1], or configure your app not to preallocate in the first 
place[2].

---
[1] On btrfs, nocow implies no checksumming or transparent compression, 
either.  Also, the nocow attribute needs to be set on the empty file, 
with the easiest way to do that being to set it on the parent directory 
before file creation, so it's inherited by any newly created files/
subdirs within it.

[2] Many apps that preallocate by default have an option to turn 
preallocation off.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs space used issue

2018-02-28 Thread Duncan

Austin S. Hemmelgarn posted on Wed, 28 Feb 2018 14:24:40 -0500 as
excerpted:

>> I believe this effect is what Austin was referencing when he suggested
>> the defrag, tho defrag won't necessarily /entirely/ clear it up.  One
>> way to be /sure/ it's cleared up would be to rewrite the entire file,
>> deleting the original, either by copying it to a different filesystem
>> and back (with the off-filesystem copy guaranteeing that it can't use
>> reflinks to the existing extents), or by using cp's --reflink=never
>> option.
>> (FWIW, I prefer the former, just to be sure, using temporary copies to
>> a suitably sized tmpfs for speed where possible, tho obviously if the
>> file is larger than your memory size that's not possible.)

> Correct, this is why I recommended trying a defrag.  I've actually never
> seen things so bad that a simple defrag didn't fix them however (though
> I have seen a few cases where the target extent size had to be set
> higher than the default of 20MB).

Good to know.  I knew larger target extent sizes could help, but between 
not being sure they'd entirely fix it and not wanting to get too far down 
into the detail when the copy-off-the-filesystem-and-back option is 
/sure/ to fix the problem, I decided to handwave that part of it. =:^)

> Also, as counter-intuitive as it
> might sound, autodefrag really doesn't help much with this, and can
> actually make things worse.

I hadn't actually seen that here, but suspect I might, now, as previous 
autodefrag behavior on my system tended to rewrite the entire file[1], 
thereby effectively giving me the benefit of the copy-away-and-back 
technique without actually bothering, while that "bug" has now been fixed.

I sort of wish the old behavior remained an option, maybe 
radicalautodefrag or something, and must confess to being a bit concerned 
over the eventual impact here now that autodefrag does /not/ rewrite the 
entire file any more, but oh, well...  Chances are it's not going to be 
/that/ big a deal since I /am/ on fast ssd, and if it becomes one, I 
guess I can just setup say firefox-profile-defrag.timer jobs or whatever, 
as necessary.

---
[1] I forgot whether it was ssd behavior, or compression, or what, but 
something I'm using here apparently forced autodefrag to rewrite the 
entire file, and a recent "bugfix" changed that so it's more in line with 
the normal autodefrag behavior.  I rather preferred the old behavior, 
especially since I'm on fast ssd and all my large files tend to be write-
once no-rewrite anyway, but I understand the performance implications on 
large active-rewrite files such as gig-plus database and VM-image files, 
so...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs space used issue

2018-02-28 Thread Duncan

mer, just to be sure, using temporary copies to a 
suitably sized tmpfs for speed where possible, tho obviously if the file 
is larger than your memory size that's not possible.)

Of course where applicable, snapshots and dedup keep reflink-references 
to the old extents, so they must be adjusted or deleted as well, to 
properly free that space.

---
[1] du: Because its purpose is different.  du's primary purpose is 
telling you in detail what space files take up, per-file and per-
directory, without particular regard to usage on the filesystem itself.  
df's focus, by contrast, is on the filesystem as a whole.  So where two 
files share the same extent due to reflinking, du should and does count 
that usage for each file, because that's what each file /uses/ even if 
they both use the same extents.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ongoing Btrfs stability issues

2018-02-16 Thread Duncan

Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
excerpted:

> This will probably sound like an odd question, but does BTRFS think your
> storage devices are SSD's or not?  Based on what you're saying, it
> sounds like you're running into issues resulting from the
> over-aggressive SSD 'optimizations' that were done by BTRFS until very
> recently.
> 
> You can verify if this is what's causing your problems or not by either
> upgrading to a recent mainline kernel version (I know the changes are in
> 4.15, I don't remember for certain if they're in 4.14 or not, but I
> think they are), or by adding 'nossd' to your mount options, and then
> seeing if you still have the problems or not (I suspect this is only
> part of it, and thus changing this will reduce the issues, but not
> completely eliminate them).  Make sure and run a full balance after
> changing either item, as the aforementioned 'optimizations' have an
> impact on how data is organized on-disk (which is ultimately what causes
> the issues), so they will have a lingering effect if you don't balance
> everything.

According to the wiki, 4.14 does indeed have the ssd changes.

According to the bug, he's running 4.13.x on one server and 4.14.x on 
two.  So upgrading the one to 4.14.x should mean all will have that fix.

However, without a full balance it /will/ take some time to settle down 
(again, assuming btrfs was using ssd mode), so the lingering effect could 
still be creating problems on the 4.14 kernel servers for the moment.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: fatal database corruption with btrfs "out of space" with ~50 GB left

2018-02-14 Thread Duncan

Tomasz Chmielewski posted on Thu, 15 Feb 2018 16:02:59 +0900 as excerpted:

>> Not sure if the removal of 80G has anything to do with this, but this
>> seems that your metadata (along with data) is quite scattered.
>> 
>> It's really recommended to keep some unallocated device space, and one
>> of the method to do that is to use balance to free such scattered space
>> from data/metadata usage.
>> 
>> And that's why balance routine is recommened for btrfs.
> 
> The balance might work on that server - it's less than 0.5 TB SSD disks.
> 
> However, on multi-terabyte servers with terabytes of data on HDD disks,
> running balance is not realistic. We have some servers where balance was
> taking 2 months or so, and was not even 50% done. And the IO load the
> balance was adding was slowing the things down a lot.

Try a filtered balance.  Something along the lines of:

btrfs balance start -dusage=10 

The -dusage number, a limit on the chunk usage percentage, can start 
small, even 0, and be increased as necessary, until btrfs fi usage 
reports data size (currently 411 GiB) closer to data usage (currently 
246.14 GiB), with the freed space returning to unallocated.

I'd shoot for reducing data size to under 300 GiB, thus returning over 
100 GiB to unallocated, while hopefully not requiring too high a -dusage 
percentage and thus too long a balance time.  You could get it down under 
250 gig size, but that would likely take a lot of rewriting for little 
additional gain, since with it under 300 gig size you should already have 
over 100 gig unallocated.

Balance time should be quite short for low percentages, with a big 
payback if there's quite a few chunks with little usage, because at 10%, 
the filesystem can get rid of 10 chunks while only rewriting the 
equivalent of a single full chunk.

Obviously as the chunk usage percentage goes up, the payback goes down, 
so at 50%, it can only clear two chunks while writing one, and at 66%, it 
has to write two chunks worth to clear three.  Above that (tho I tend to 
round up to 70% here) is seldom worth it until the filesystem gets quite 
full and you're really fighting to keep a few gigs of unallocated space.  
(As Qu indicated, you always want at least a gig of unallocated space, on 
at least two devices if you're doing raid1.)

If you really wanted you could do the same with -musage for metadata, 
except that's not so bad, only 9 gig size, 3 gig used.  But you could 
free 5 gigs or so, if desired.


That's assuming there's no problem.  I see a followup indicating you're 
seeing problems in dmesg with a balance, however, and will let others 
deal with that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of FST and mount times

2018-02-14 Thread Duncan

Qu Wenruo posted on Thu, 15 Feb 2018 09:42:27 +0800 as excerpted:

> The easiest way to get a basic idea of how large your extent tree is
> using debug tree:
> 
> # btrfs-debug-tree -r -t extent 
> 
> You would get something like:
> btrfs-progs v4.15 extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776
> level 0  <<<
> total bytes 10737418240 bytes used 393216 uuid
> 651fcf0c-0ffd-4351-9721-84b1615f02e0
> 
> That level is would give you some basic idea of the size of your extent
> tree.
> 
> For level 0, it could contains about 400 items for average.
> For level 1, it could contains up to 197K items.
> ...
> For leven n, it could contains up to 400 * 493 ^ (n - 1) items.
> ( n <= 7 )

So for level 2 (which I see on a couple of mine here, ran it out of 
curiosity):

400 * 493 ^ (2 - 1) = 400 * 493 = 197200

197K for both level 1 and level 2?  Doesn't look correct.

Perhaps you meant a simple power of n, instead of (n-1)?  That would 
yield ~97M for level 2, and would yield the given numbers for levels 0 
and 1 as well, whereby using n-1 for level 0 yields less than a single 
entry, and 400 for level 1.

Or the given numbers were for level 1 and 2, with level 0 not holding 
anything, not levels 0 and 1.  But that wouldn't jive with your level 0 
example, which I would assume could never happen if it couldn't hold even 
a single entry.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: fatal database corruption with btrfs "out of space" with ~50 GB left

2018-02-14 Thread Duncan

Tomasz Chmielewski posted on Wed, 14 Feb 2018 23:19:20 +0900 as excerpted:

> Just FYI, how dangerous running btrfs can be - we had a fatal,
> unrecoverable MySQL corruption when btrfs decided to do one of these "I
> have ~50 GB left, so let's do out of space (and corrupt some files at
> the same time, ha ha!)".

Ouch!

> Running btrfs RAID-1 with kernel 4.14.

Kernel 4.14... quite current... good.  But 4.14.0 first release, 4.14.x 
current stable, or somewhere (where?) in between?

And please post the output of btrfs fi usage for that filesystem.  
Without that (or fi sh and fi df, the pre-usage method of getting nearly 
the same info), it's hard to say where or what the problem was.

Meanwhile, FWIW there was a recent metadata over-reserve bug that should 
be fixed in 4.15 and the latest 4.14 stable, but IDR whether it affected 
4.14.0 original or only the 4.13 series and early 4.14-rcs and was fixed 
by 4.14.0.  The bug seemed to trigger most frequently when doing balances 
or other major writes to the filesystem, on middle to large sized 
filesystems.  (My all under quarter-TB each btrfs didn't appear to be 
affected.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of FST and mount times

2018-02-14 Thread Duncan

there's nothing in your post indicating it's valid as 
/your/ case.

Of course the other possibility is live-failover, which is sure to be 
facebook's use-case.  But with live-failover, the viability of btrfs 
check --repair more or less ceases to be of interest, because the failover 
happens (relative to the offline check or restore time) instantly, and 
once the failed devices/machine is taken out of service it's far more 
effective to simply blow away the filesystem (if not replacing the 
device(s) entirely) and restore "at leisure" from backup, a relatively 
guaranteed procedure compared to the "no guarantees" of attempting to 
check --repair the filesystem out of trouble.

Which is very likely why the free-space-tree still isn't well supported 
by btrfs-progs, including btrfs check, several kernel (and thus -progs) 
development cycles later.  The people who really need the one (whichever 
one of the two)... don't tend to (or at least /shouldn't/) make use of 
the other so much.

It's also worth mentioning that btrfs raid0 mode, as well as single mode, 
hobbles the btrfs data and metadata integrity feature, because while 
checksums can and are still generated, stored and checked by default, and 
integrity problems can still be detected, because raid0 (and single) 
includes no redundancy, there's no second copy (raid1/10) or parity 
redundancy (raid5/6) to rebuild the bad data from, so it's simply gone.  
(Well, for data you can try btrfs restore of the otherwise inaccessible 
file and hope for the best, and for metadata, you can try check --repair 
and again hope for the best, but...)  If you're using that feature of 
btrfs and want/need more than just detection of a problem that can't be 
fixed due to lack of redundancy, there's a good chance you want a real 
redundancy raid mode on multi-device, or dup mode on single device.

So bottom line... given the sacrificial lack of redundancy and 
reliability of raid0, btrfs or not, in an enterprise setting with tens of 
TB of data, why are you worrying about the viability of btrfs check --
repair on what the placement on raid0 decrees to be throw-away data 
anyway?  At first glance anyway, one or the other, either the raid0 mode 
and thus declared throw-away value of tens of TB of data, or the 
viability of btrfs check --repair, indicating you don't consider the data 
you just declared to be of throw-away value by placing it on raid0, to be 
of throw-away value after all, must be wrong.  Which one is wrong is your 
call, and there's certainly individual cases (one of which I even named) 
where concern about the viability of btrfs check --repair on raid0 might 
be valid, but your post has no real indication that your case is such a 
case, and honestly, that worries me!

> 2. There's another thread on-going about mount delays.  I've been
> completely blind to this specific problem until it caught my eye.  Does
> anyone have ballpark estimates for how long very large HDD-based
> filesystems will take to mount?  Yes, I know it will depend on the
> dataset.  I'm looking for O() worst-case approximations for
> enterprise-grade large drives (12/14TB), as I expect it should scale
> with multiple drives so approximating for a single drive should be good
> enough.

No input on that question here (my own use-case couldn't be more 
different, multiple small sub-half-TB independent btrfs raid1s on 
partitioned ssds), but another concern, based on real-world reports I've 
seen on-list:

12-14 TB individual drives?

While you /did/ say enterprise grade so this probably doesn't apply to 
you, it might apply to others that will read this.

Be careful that you're not trying to use the "archive application" 
targeted SMR drives for general purpose use.  Occasionally people will 
try to buy and use such drives in general purpose use due to their 
cheaper per-TB cost, and it just doesn't go well.  We've had a number of 
reports of that. =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs - kernel warning

2018-02-04 Thread Duncan

Duncan posted on Fri, 02 Feb 2018 02:49:52 + as excerpted:

> As CMurphy says, 4.11-ish is starting to be reasonable.  But you're on
> the LTS kernel 4.14 series and userspace 4.14 was developed in parallel,
> so btrfs-progs-3.14 would be ideal.

Umm... obviously that should be 4.14.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs - kernel warning

2018-02-01 Thread Duncan

 data since the last backup becomes more valuable than 
the time/trouble/resources necessary to update your backup, you will do 
so.  If you haven't, it simply means you're defining the changes since 
your last backup as of less value than the time/trouble/resources 
necessary to do that update, so again, you can *always* rest easy in the 
face of filesystem or device problems, because you either have it backed 
up, or by definition of /not/ having it backed up, it was self-evidently 
not worth the trouble to do so yet, so you saved what was most important 
to you either way.

So think about your value definitions regarding your data and change them 
if you need to... while you still have the chance. =:^)

(And the implications of the above change how you deal with a broken 
filesystem too.  With either current backups or what you've literally 
defined as throw-away data due to it not being worth the trouble of 
backups, it makes little sense to spend more than a trivial amount of 
time trying to recover data from a messed up filesystem, especially given 
that there's no guarantee you'll get it all back undamaged even if you 
/do/ spend time time.  It's often simpler and takes less time, as well as 
more success-sure, to simply blow away the defective filesystem with a 
fresh mkfs and restore the data from backups, since that way you know 
you'll have a fresh filesystem and known-good data from the backup, as 
opposed to no guarantees /what/ you'll end up with trying to recover/
repair the old filesystem.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

2018-01-28 Thread Duncan

Andrei Borzenkov posted on Sun, 28 Jan 2018 11:06:06 +0300 as excerpted:

> 27.01.2018 18:22, Duncan пишет:
>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>> 
>>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>>>
>>>>>> I just tested to boot with a single drive (raid1 degraded), even
>>>>>> with degraded option in fstab and grub, unable to boot !  The boot
>>>>>> process stop on initramfs.
>>>>>>
>>>>>> Is there a solution to boot with systemd and degraded array ?
>>>>>
>>>>> No. It is finger pointing. Both btrfs and systemd developers say
>>>>> everything is fine from their point of view.
>>>
>>> It's quite obvious who's the culprit: every single remaining rc system
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>> 
>> No kidding.
>> 
>> All systemd has to do is leave the mount alone that the kernel has
>> already done,
> 
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

As Tomaz indicates, I'm talking about manual mounting (after the initr* 
drops to a maintenance prompt if it's root being mounted, or on manual 
mount later if it's an optional mount) here.  The kernel accepts the 
degraded mount and it's mounted for a fraction of a second, but systemd 
actually undoes the successful work of the kernel to mount it, so by the 
time the prompt returns and a user can check, the filesystem is unmounted 
again, with the only indication that it was mounted at all being the log.

He says that's because the kernel still says it's not ready, but that's 
for /normal/ mounting.  The kernel accepted the degraded mount and 
actually mounted the filesystem, but systemd undoes that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

2018-01-27 Thread Duncan

Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:

> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>> 
>> >> I just tested to boot with a single drive (raid1 degraded), even
>> >> with degraded option in fstab and grub, unable to boot !  The boot
>> >> process stop on initramfs.
>> >> 
>> >> Is there a solution to boot with systemd and degraded array ?
>> > 
>> > No. It is finger pointing. Both btrfs and systemd developers say
>> > everything is fine from their point of view.
> 
> It's quite obvious who's the culprit: every single remaining rc system
> manages to mount degraded btrfs without problems.  They just don't try
> to outsmart the kernel.

No kidding.

All systemd has to do is leave the mount alone that the kernel has 
already done, instead of insisting it knows what's going on better than 
the kernel does, and immediately umounting it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bad key ordering - repairable?

2018-01-24 Thread Duncan

ion of 
ssds, a pair of 1 TB samsung evos, but this reminds me that at nearing 
six years old the main system's aging too, so I better start thinking of 
replacing it again...)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-23 Thread Duncan

ein posted on Tue, 23 Jan 2018 09:38:13 +0100 as excerpted:

> On 01/22/2018 09:59 AM, Duncan wrote:
>> 
>> And to tie up a loose end, xfs has somewhat different design principles
>> and may well not be particularly sensitive to the dirty_* settings,
>> while btrfs, due to COW and other design choices, is likely more
>> sensitive to them than the widely used ext* and reiserfs (my old choice
>> and the basis of my own settings, above).

> Excellent booklike writeup showing how /proc/sys/vm/ works, but I
> wonder, how can you explain why does XFS work in this case?

I can't, directly, which is why I glossed over it so fast above.  I do 
have some "educated guesswork", but that's _all_ it is, as I've not had 
reason to get particularly familiar with xfs and its quirks.  You'd have 
to ask the xfs folks if my _guess_ is anything approaching reality, but 
if you do please be clear that I explicitly said I don't know and that 
this is simply my best guess based on the very limited exposure to xfs 
discussions I've had.

So I'm not experience-familiar with xfs and other than what I've happened 
across in cross-list threads here, know little about it except that it 
was ported to Linux from other *ix.  I understand the xfs port to 
"native" is far more complete than that of zfs, for example.  
Additionally, I know from various vfs discussion threads cross-posted to 
this and other filesystem lists that xfs remains rather different than 
some -- apparently (if I've gotten it right) it handles "objects" rather 
than inodes and extents, for instance.

Apparently, if the vfs threads I've read are to be believed, xfs would 
have some trouble with a proposed vfs interface that would allow requests 
to write out and free N pages or N KiB of dirty RAM from the write 
buffers in ordered to clear memory for other usage, because it tracks 
objects rather than dirty pages/KiB of RAM.  Sure it could do it, but it 
wouldn't be an efficient enough operation to be worth the trouble for 
xfs.  So apparently xfs just won't make use of that feature of the 
proposed new vfs API, there's nothing that says it /has/ to, after all -- 
it's proposed to be optional, not mandatory.

Now that discussion was in a somewhat different context than the 
vm.dirty_* settings discussion here, but it seems reasonable to assume 
that if xfs would have trouble converting objects to the size of the 
memory they take in the one case, the /proc/sys/vm/dirty_* dirty writeback 
cache tweaking features may not apply to xfs, at least in a direct/
intuitive way, either.

Which is why I suggested xfs might not be particularly sensitive to those 
settings -- I don't know that it ignores them entirely, and it may use 
them in /some/ way, possibly indirectly, but the evidence I've seen does 
suggest that xfs may, if it uses those settings at all, not be as 
sensitive to them as btrfs/reiserfs/ext*.

Meanwhile, due to the extra work btrfs does with checksumming and cow, 
while AFAIK it uses the settings "straight", having them out of whack 
likely has a stronger effect on btrfs than it does on ext* and reiserfs 
(with reiserfs likely being slightly more strongly affected than ext*, 
but not to the level of btrfs).

And there has indeed been confirmation on-list that adjusting these 
settings *does* have a very favorable effect on btrfs for /some/ use-
cases.

(In one particular case, the posting was to the main LKML, but on btrfs 
IIRC, and Linus got involved.  I don't believe that lead to the 
/creation/ of the relatively new per-device throttling stuff as I believe 
the patches were already around, but I suspect it may have lead to their 
integration in mainline a few kernel cycles earlier than they may have 
been otherwise.  Because it's a reasonably well known "secret" that the 
default ratios are out of whack on modern systems, it's just not settled 
what the new defaults /should/ be, so in the absence of agreement or 
pressing problem, they remain as they are.  But Linus blew his top as 
he's known to do, he and others pointed the reporter at the vm.dirty_* 
settings tho Linus wanted to know why the defaults were so insane for 
today's machines, and tweaking those did indeed help.  Then a kernel 
cycle or two later the throttling options appeared in mainline, very 
possibly as a result of Linus "routing around the problem" to some 
extent.)

So in my head I have a picture of the possible continuum of vm.dirty_ 
effect that looks like this:

<- weak effectstrong ->

zfsxfs.ext*reiserfs.btrfs

zfs, no or almost no effect, because it uses non-native mechanism and is 
poorly adapted to Linux.

xfs, possibly some effect, but likely relatively light, because its 
mechanisms aren't completely adapted to Linux-vfs-native either, and if 
it uses those settings at

Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-22 Thread Duncan

e kyber 
and bfq schedulers, as well -- and setting IO priority -- probably by 
increasing the IO priority of the streaming app.  The tool to use for the 
latter is called ionice.  Do note, however, that not all schedulers 
implement IO priorities.  CFQ does, but while I think deadline should 
work better for the streaming use-case, it's simpler code and I don't 
believe it implements IO priority.  Similarly for multi-queue, I'd guess 
the low-code-designed-for-fast-direct-PCIE-connected-SSD kyber doesn't 
implement IO priorities, while the more complex and general purpose 
suitable-for-spinning-rust bfq /might/ implement IO priorities.

But I know less about that stuff and it's googlable, should you decide to 
try playing with it too.  I know what the dirty_* stuff does from 
personal experience. =:^)


And to tie up a loose end, xfs has somewhat different design principles 
and may well not be particularly sensitive to the dirty_* settings, while 
btrfs, due to COW and other design choices, is likely more sensitive to 
them than the widely used ext* and reiserfs (my old choice and the basis 
of my own settings, above).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs volume corrupt. btrfs-progs bug or need to rebuild volume?

2018-01-19 Thread Duncan

Rosen Penev posted on Fri, 19 Jan 2018 13:45:35 -0800 as excerpted:

> v2: Add proper subject

=:^)

> I've been playing around with a specific kernel on a specific device
> trying to figure out why btrfs keeps throwing csum errors after ~15
> hours. I've almost nailed it down to some specific CONFIG option in the
> kernel, possibly related to IRQs.
> 
> Anyway, I managed to get my btrfs RAID5 array corrupted to the point
> where it will just mount to read-only mode.

[...]

> This is with version 4.14 of btrfs-progs. Do I need a newer version or
> should I just reinitialize my array and copy everything back?
> 
> Log on mount attached below:

[...]

> Fri Jan 19 14:26:08 2018 kern.warn kernel:
> [168383.378239] CPU: 0 PID:
> 2496 Comm: kworker/u8:2 Tainted: GW   4.9.75 #0

Tho as the penultimate LTS kernel series 4.9 is still on the btrfs-list 
supported list in general... 4.9 still had known btrfs raid56 mode issues 
and is strongly negatively recommended for use with btrfs raid56 mode.  
Those weren't fixed until 4.12, which /finally/ brought raid56 mode into 
generally working and not negatively recommended state.

While as an LTS applicable general btrfs bug fixes would be backported to 
4.9, because raid56 mode had never worked /well/ at that point, I'm not 
sure those fixes were backported.

So you really need either kernel 4.12+, presumably the LTS 4.14 series 
since you're on LTS 4.9 series now, for btrfs raid56 mode, or don't use 
raid56 mode if you plan on staying with the 4.9 LTS, as it still had 
severe known issues back then and I haven't seen on-list confirmation 
that the 4.12 btrfs raid56 mode fixes were backported to 4.9-LTS.  

If you need/choose to stick with 4.9 and dump raid56 mode, the 
recommended alternative depends on the number of devices in the 
filesystem.

For a small number of devices in the filesystem, btrfs raid1 is 
effectively as stable as the still stabilizing and maturing btrfs itself 
is at this point and is recommended.

For a larger number of devices, btrfs raid1 is still a good choice 
because it /is/ the most mature, but btrfs raid10 is /reasonably/ stable 
tho IMO not quite as stable as raid1, or for better performance (due to 
btrfs raid10 not being read-optimized yet) while keeping btrfs 
checksumming and error repair from the second copy when available, 
consider a layered approach, with btrfs raid1 on top of a pair of mdraid0s 
(or dmraid0s, or hardware raid0s).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: big volumes only work reliable with ssd_spread

2018-01-15 Thread Duncan

Stefan Priebe - Profihost AG posted on Mon, 15 Jan 2018 10:55:42 +0100 as
excerpted:

> since around two or three years i'm using btrfs for incremental VM
> backups.
> 
> some data:
> - volume size 60TB
> - around 2000 subvolumes
> - each differential backup stacks on top of a subvolume
> - compress-force=zstd
> - space_cache=v2
> - no quote / qgroup
> 
> this works fine since Kernel 4.14 except that i need ssd_spread as an
> option. If i do not use ssd_spread i always end up with very slow
> performance and a single kworker process using 100% CPU after some days.
> 
> With ssd_spread those boxes run fine since around 6 month. Is this
> something expected? I haven't found any hint regarding such an impact.

My understanding of the technical details is "limited" as I'm not a dev, 
and I expect you'll get a more technically accurate response later, but 
sometimes a first not particularly technical response can be helpful as 
long as it's not /wrong/.  (And if it is this is a good way to have my 
understanding corrected as well. =:^)  With that caveat, based on my 
understanding of what I've seen on-list...

The kernel v4.14 ssd mount-option changes apparently primarily affected 
data, not metadata.  Apparently, ssd_spread has a heavier metadata 
effect, and the v4.14 changes moved additional (I believe metadata) 
functionality to ssd-spread that had originally been part of ssd as 
well.  There has been some discussion of metadata tweaks similar to those 
in 4.14 for the ssd option with data, but they weren't deemed as 
demonstrably needed as the ssd option tweaks and needed further 
discussion, so were put off until the effect of the 4.14 tweaks could be 
gauged in more widespread use, after which they were to be reconsidered, 
if necessary.

Meanwhile, in the discussion I saw, Chris Mason mentioned that Facebook 
is using ssd-spread for various reasons there, so it's well-tested with 
their deployments, which I'd assume have many of the same qualities yours 
do, thus implying that your observations about ssd-spread are no accident.

In fact, if I interpreted Chris's comments correctly, they use ssd_spread 
on very large multi-layered non-ssd storage arrays, in part because the 
larger layout-alignment optimizations make sense there as well as on 
ssds.  That would appear to be precisely what you are seeing. =:^)  If 
that's the case, then arguably the option is misnamed and the ssd_spread 
name may well at some point be deprecated in favor of something more 
descriptive of its actual function and target devices.  Purely my own 
speculation here, but perhaps something like vla_spread (very-large-
array)?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hanging after frequent use of systemd-nspawn --ephemeral

2018-01-14 Thread Duncan

Qu Wenruo posted on Sun, 14 Jan 2018 10:27:40 +0800 as excerpted:

> Despite of that, did that really hangs?
> Qgroup dramatically increase overhead to delete a subvolume or balance
> the fs.
> Maybe it's just a little slow?

Same question about the "hang" here.

Note that btrfs is optimized to make snapshot creation fast, while 
snapshot deletion has to do more work to clean things up.  So even 
without qgroup enabled, deletion can take a bit of time (much longer than 
creation, which should be nearly instantaneous in human terms) if there's 
a lot of relinks and the like to clean up.

And qgroups makes btrfs do much more work to track that as well, so as Qu 
says, that'll make snapshot deletion take even longer, and you probably 
want it disabled unless you actually need the feature for something 
you're doing.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Limit on the number of btrfs snapshots?

2018-01-14 Thread Duncan

Daniel E. Shub posted on Fri, 12 Jan 2018 16:38:30 -0500 as excerpted:

> A couple of years ago I asked a question on the Unix and Linux Stack
> Exchange about the limit on the number of BTRFS snapshots:
> https://unix.stackexchange.com/q/140360/22724
> 
> Basically, I want to use something like snapper to take time based
> snapshots so that I can browse old versions of my data. This would be in
> addition to my current off site backup since a drive failure would wipe
> out the data and the snapshots. Is there a limit to the number of
> snapshots I can take and store? If I have a million snapshots (e.g., a
> snapshot every minute for two years) would that cause havoc, assuming I
> have enough disk space for the data, the changed data, and the meta
> data?
> 
> The answers there provided a link to the wiki:
> https://btrfs.wiki.kernel.org/index.php/
Btrfs_design#Snapshots_and_Subvolumes
> that says: "snapshots are writable, and they can be snapshotted again
> any number of times."
> 
> While I don't doubt that that is technically true, another user
> suggested that the practical limit is around 100 snapshots.
> 
> While I am not convinced that having minute-by-minute versions of my
> data for two years is helpful (how the hell is anyone going to find the
> exact minute they are looking for), if there is no cost then I figure
> why not.
> 
> I guess I am asking is what is the story and where is it documented.

Very logical question. =:^)

The (practical) answer depends to some extent on how you use btrfs.

Btrfs does have scaling issues due to too many snapshots (or actually the 
reflinks snapshots use, dedup using reflinks can trigger the same scaling 
issues), and single to low double-digits of snapshots per snapshotted 
subvolume remains the strong recommendation for that reason.

But the scaling issues primarily affect btrfs maintenance commands 
themselves, balance, check, subvolume delete.  While millions of 
snapshots will make balance for example effectively unworkable (it'll 
sort of work but could take months), normal filesystem operations like 
reading and saving files doesn't tend to be affected, except to the 
extent that fragmentation becomes an issue (tho cow filesystems such as 
btrfs are noted for fragmentation, unless steps like defrag are taken to 
reduce it).

So for home and SOHO type usage where you might for instance want to add 
a device to the filesystem and rebalance to make full use of it, and 
where when a filesystem won't mount you are likely to want to run btrfs 
check to try to fix it, a max of 100 or so snapshots per subvolume is 
indeed a strong recommendation.

But in large business/corporate environments where there's hot-spare 
standbys to fail-over to and three-way offsite backups of the hot-spare 
and onsite backups, it's not such a big issue, because rather than 
balancing or fscking, such usage generally just fail-overs to the backups 
and recycles the previous working filesystem devices, so a balance or a 
check taking three years isn't an issue because they don't tend to run 
those sorts of commands in the first place.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recommendations for balancing as part of regular maintenance?

2018-01-11 Thread Duncan

Austin S. Hemmelgarn posted on Wed, 10 Jan 2018 12:01:42 -0500 as
excerpted:

>> - Some experienced users say that, to resolve a problem with DoUS, they
>> would rather recreate the filesystem than run balance.

> This is kind of independent of BTRFS.  A lot of seasoned system
> administrators are going to be more likely to just rebuild a broken
> filesystem from scratch if possible than repair it simply because it's
> more reliable and generally guaranteed to fix the issue.  It largely
> comes down to the mentality of the individual, and how confident they
> are that they can fix a problem in a reasonable amount of time without
> causing damage elsewhere.

Specific to this one...

I'm known around here for harping on the backup point (hold on, I'll 
explain how that ties in).  A/the sysadmin's first rule of backups: The 
(true) value of your data is defined not by any arbitrary claims, but by 
how many backups of that data you consider it worth having.  No backups 
defines the data as of only trivial value, worth less than the time/
trouble/resources necessary to make that backup.

It therefore follows that in the event of data mishap, a sysadmin can 
always rest happy, because regardless of what might have been lost, what 
actions defined as of *MOST* value, either the data if it was backed up, 
or the time/trouble/resources that would have otherwise gone into that 
backup if not, was *ALWAYS* saved.

Understanding that puts an entirely different spin on backups and data 
mishaps, taking a lot of the pressure off when things /do/ go wrong, 
because one understands that the /true/ value of that data was defined 
long before, and now we're simply dealing with the results of our 
decision to define it that way, only playing out the story we setup for 
ourselves long before.

But how does that apply to the current discussion?

Simply this way:  For someone understanding the above, repair is never a 
huge problem or priority, because the data was either of such trivial 
value as to make it no big deal, or there were backups, thus making this 
particular instance of the data, and the necessity of repair, no big deal.

Once /that/ is understood, the question of repair vs. rebuild from 
scratch (or even simply fail-over to the hot-spare and send the old 
filesystem component devices to be tested for reuse or recycle) becomes 
purely one of efficiency, and the answer ends up being pretty 
predictable, because rebuild from scratch and restore from backup should 
be near 100% reliable on a reasonable/predictable time frame, vs. 
/attempting/ a repair with unknown likelihood of success and a much 
/less/ predictable time frame, especially since there's a non-trivial 
chance one will have to fall back to the rebuild from scratch and backups 
method anyway, after repair attempts fail.


Once one is thinking in those terms and already has backups accordingly, 
even for home or other one-off systems where actual formatting and 
restore from backups is going to be manual and thus will take longer than 
a trivial fix, the practical limits on the extents to which one is 
willing to go to get a fix are pretty narrow, and while one might try a 
couple fixes if they're easy and quick enough, beyond that it very 
quickly becomes restore from backups time if the data was considered 
valuable enough to be worth making them, or simply throw it away and 
start over if the data wasn't considered valuable enough to be worth 
making a backup in the first place.


So it's really independent of btrfs and not reflective on the reliability 
of balance, etc, at all.  It's simply a reflection of understanding the 
realities of possible repair... or not and having to replace anyway... 
without a good estimate on the time required either way... vs. a (near) 
100% guaranteed fix and back in business, in a relatively tightly 
predictable timeframe.  Couple that with the possibility that a repair 
may leave other problems latent and ready to be exposed later, while 
starting over from scratch gives you a "clean starting point", and it's 
pretty much a no-brainer, regardless of the filesystem... or whatever 
else (hardware, software layers other than the filesystem) may be in use.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recommendations for balancing as part of regular maintenance?

2018-01-09 Thread Duncan

Austin S. Hemmelgarn posted on Tue, 09 Jan 2018 07:46:48 -0500 as
excerpted:

>> On 08/01/18 23:29, Martin Raiber wrote:
>>> There have been reports of (rare) corruption caused by balance (won't
>>> be detected by a scrub) here on the mailing list. So I would stay a
>>> away from btrfs balance unless it is absolutely needed (ENOSPC), and
>>> while it is run I would try not to do anything else wrt. to writes
>>> simultaneously.
>> 
>> This is my opinion too as a normal user, based upon reading this list
>> and own attempts to recover from ENOSPC. I'd rather re-create
>> filesystem from scratch, or at least make full verified backup before
>> attempting to fix problems with balance.

> While I'm generally of the same opinion (and I have a feeling most other
> people who have been server admins are too), it's not a very user
> friendly position to recommend that.  Keep in mind that many (probably
> most) users don't keep proper backups, and just targeting 'sensible'
> people as your primary audience is a bad idea.  It also needs to work at
> at least a basic level anyway though simply because you can't always
> just nuke the volume and rebuild it from scratch.
> 
> Personally though, I don't think I've ever seen issues with balance
> corrupting data, and I don't recall seeing complaints about it either
> (though I would love to see some links that prove me wrong).

AFAIK, such corruption reports re balance aren't really balance, per se, 
at all.

Instead, what I've seen in nearly all cases is a number of filesystem 
maintenance commands involving heavy I/O colliding, that is, being run at 
the same time, possibly because some of them are scheduled, and the admin 
didn't take into account scheduled commands when issuing others manually.

I don't believe anyone would recommend running balance, scrub, snapshot-
deletion, and backups (rsync or btrfs send/receive being the common 
ones), all at the same time, or even two or more at the same time, if for 
no other reason than because they're all IO intensive and running just 
/one/ of them at a time is hard /enough/ on the system and the 
performance of anything else running at the same time, even when all 
components are fully stable and mature (and as we all know, btrfs is 
stabilizing, but not yet fully stable and mature), yet that's what these 
sorts of reports invariably involve.

Of course, with a certainty btrfs /should/ be able to handle more than 
one of these at once without corruption, because anything else is a bug, 
but... btrfs /is/ still stabilizing and maturing, and it's precisely this 
sort of rare corner-case race-condition bugs where more than one 
extremely heavy IO filesystem maintenance command is being run at the 
same time that tend to be the last to be found and fixed, because they 
/are/ rare corner-cases, often depending on race conditions, that tend to 
be rare enough reported, and then extremely difficult to duplicate, so 
that's exactly the type of bugs that tend to remain around at this point.

So rather than discouraging a sane-filtered regular balance (which I'll 
discuss in a different reply), I'd suggest that the more sane 
recommendation is to be aware of other major-IO filesystem maintenance 
commands (not just btrfs commands but rsync-based backups, etc, too, 
rsync being demanding enough on its own to have triggered a number of 
btrfs bug reports and fixes over the years), including scheduled 
commands, and to only run one at a time.

IOW, don't do a balance if your scheduled backup or snapshot-deletion is 
about to kick in.  One at a time is stressful enough on the filesystem 
and hardware, don't compound the problem trying to do two or more at once!

So assuming a weekly schedule, do one a day of balance, scrub, snapshot-
deletion, backups (after ensuring that none of them take over a day, 
balance in particular could at TiB-scale+ if not sanely filtered, 
particularly if quotas are enabled due to the scaling issues of that 
feature).  And if any of those are scheduled daily or more frequently, 
space the scheduling appropriately and ensure they're done before 
starting the next task.

And keep in mind the scheduled tasks when running things manually, so as 
not to collide there either.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs scrub not repair corruption

2018-01-09 Thread Duncan

Wolf posted on Mon, 08 Jan 2018 23:27:27 +0100 as excerpted:

> I'm running btrfs scrub on my raid each week (is that too often?) and
> I'm having a problem that it reports corruption, says it's repaired but
> next week reports it again.

I won't attempt to answer the larger question, but on the narrow "too 
often?" question, no, running scrub once a week shouldn't be a problem.

Scrub is read-only unless it finds errors, so even running it repeatedly 
end-to-end shouldn't be a problem, other than the obvious performance 
issue and the potential increased head-seek wear on non-ssd devices.  The 
obvious issue would be slowing down whatever else you're doing at the 
same time, and at whatever presumably scheduled weekly time you run it 
that's evidently not a problem for your use-case.


Also, a bit OT as I don't believe it's related to this, but FWIW...

There *has* been a recent kernel issue with gentoo-hardened compiling 
kernel code incorrectly due to a gcc option enabled by default on 
hardened.  I don't remember the details, but I ran across in in one of 
the kernel development articles I read.  I /think/ it applied only to 
4.15-rc, however, or possibly 4.14.  The fix is to disable that specific 
gcc option when building the kernel, as it was designed for userspace and 
doesn't make much sense for the kernel anyway.  A patch doing just that 
should already be part of the latest 4.15-rcs and if the bug applied to 
4.14 it'll be backported there as well, but I'm not sure of current 4.14-
stable status.

(I run gentoo, so my interest perked when I came across the discussion, 
but not hardened, so I didn't need to retain the details.)

If you're not already aware of that, you might wish to research it a bit 
more, and disable whatever option manually in your kernel-build CFLAGS, 
tho as mentioned once the patch is applied the kernel make files 
automatically apply the appropriate option. (The official kernel CFLAGS 
related vars are KCFLAGS (C), KCPPFLAGS (pre-processor), and KAFLAGS 
(assembler).)  Unfortunately IDR what the specific flag was, 
-fno-something, IIRC.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs quota exceeded notifications through netlink sockets

2018-01-06 Thread Duncan

Karsai, Gabor posted on Sat, 06 Jan 2018 01:34:09 + as excerpted:

> I created a subvolume on a btrfs, set a limit and the quota is enforced
> - dumping too much data into the subvolume results in a 'quota exceeded'
> message (from dd, for example). But when I am trying to get netlink
> socket notifications, nothing arrives on the socket (I am using pyroute2
> which is supposedly able to receive disk quota notifications)
> 
> $ uname -a
> Linux riaps-dev 4.10.0-42-generic #46~16.04.1-Ubuntu SMP
> Mon Dec 4 15:57:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> 
> btrfs: whatever Ubuntu 16.04 has
> 
> Kconfig:
> CONFIG_QUOTA_NETLINK_INTERFACE=y

Someone with a bit more knowledge of quotas in general and btrfs quotas 
in particular will hopefully confirm (I'm just a btrfs user and list 
regular, without quotas in my own use-case, so this is only what I've 
seen on-list), but sometimes a fast non-authoritative answer can be more 
useful than a slower authoritative answer, so here's the first...

Btrfs quotas don't use the normal kernel quota mechanism as they work 
somewhat differently.  Indeed, the kernel-config help for CONFIG_QUOTA 
doesn't mention btrfs support at all.  As such, I don't believe the btrfs 
quota subsystem uses the normal kernel quota netlink interface at all.  
At least, I've never seen it mentioned, and it would surprise me if btrfs 
quotas /did/ use that interface, because they are different enough to be 
unlikely to properly match the expected interface API.

Meanwhile, be aware that until recently btrfs quotas were too buggy to be 
used reliably.  While they work rather better now, more minor fixes are 
still being made, with every recent kernel including 4.14 having quota 
fixes.  For this feature a current kernel is definitely recommended, and 
4.10 is neither an LTS kernel series (4.9 and 4.14 are the two most 
recent and best supported LTS series, 4.10 was simply a normal kernel and 
only had upstream support thru 4.11) nor within the latest two current 
kernels, so on-list support won't be as good as if you were running an LTS 
or current kernel.

And even with the fixes, enabling quotas increases btrfs scaling issues 
when running commands such as btrfs balance and check, tho normal runtime 
performance isn't so severely affected.  Balance in particular takes /
much/ longer when quotas are enabled due to constant quota updates as the 
blockgroups are moved around, so temporarily disabling them during 
balances is recommended to speed up the balance.  Unfortunately, the 
scenarios under which you're likely to need to run check, when the 
filesystem won't mount, prevent disabling quotas then.

So while quota numbers should be reliable with supported kernels now, 
leaving them off unless you really need the feature is still recommended.

The one good thing for use-cases that /do/ need quotas is that such use-
cases tend to be commercial systems with proper backups, where the 
performance of commands such as balance and check may not matter so much, 
since maintenance in such use-cases often consists of failing the entire 
filesystem and falling back to the hot-spare, rather than trying to do on-
the-fly filesystem maintenance such as rebalancing to a new device or 
raid layout or checking and trying to repair a filesystem that won't 
mount.  Since normal runtime performance isn't particularly affected, 
quotas tend to be fine for such use-cases.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-01 Thread Duncan

Stirling Westrup posted on Mon, 01 Jan 2018 14:44:43 -0500 as excerpted:

> In hind sight (which is always 20/20), I should have updated the backups
> before starting to make my changes, but as I'd just added a new 4T drive
> to the BTRFS RAID6 in my backup system a week before, and it went as
> smooth as butter, I guess I was feeling insufficiently paranoid.

Are you aware of btrfs raid56-mode history?

If you're running a current enough kernel (wiki says 4.12 for raid56 
mode, but you might want 4.14 for other fixes and/or the fact that it's 
LTS) the severest known raid56 issues that had it recommendation-
blacklisted are fixed, but raid56 mode still doesn't have fixes for the 
infamous parity-raid write hole, and parities are not checksummed, in 
hindsight an implementation mistake as it breaks btrfs' otherwise 
integrity and checksumming guarantees, that's going to require an on-disk 
format change and some major work to fix.

If you're running at least kernel 4.12 and are aware of and understand 
the remaining raid56 caveats, raid56 mode can be a valid choice, but if 
not, I strongly recommend doing more research to learn and understand 
those caveats, before relying too heavily on that backup.

The most reliable and well tested btrfs multi-device mode remains raid1, 
tho that's expensive in terms of space required since it duplicates 
everything.  For many devices, the recommendation seems to remain btrfs 
raid1, either straight, or on top of a pair of mdraid0s (or alike, 
dmraid0s, hardware raid0s, etc), since that performs better than btrfs 
raid10, and removes a confusing tho not harmful if properly understood 
layout ambiguity of btrfs raid10 as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A Big Thank You, and some Notes on Current Recovery Tools.

2017-12-31 Thread Duncan

ven the above, even keeping 
additional superblock copies on all the other devices isn't necessarily 
going to help much, particularly when it's all similar devices, 
presumably with similar firmware and media weak-points.

But other-device superblocks very well could have helped in a situation 
like yours, where there were two different device sizes and potentially 
brands...

> Anyway, I hope no one takes these criticisms the wrong way. I'm a huge
> fan of BTRFS and its potential, and I know its still early days for the
> code base, and it's yet to fully mature in its recovery and diagnostic
> tools. I'm just hoping that these points can contribute in some small
> way and give back some of the help I got in fixing my system!

I believe you've very likely done just that. =:^)

And even if your case doesn't result in tools to automate superblock 
restoration in cases such as yours in the immediate to near-term (say to 
three years out), it has very definitely already resulted in regulars 
that now have experience with the problem and should now find it /much/ 
easier to tackle a similar problem the next time it comes up!  And as you 
say, it almost certainly /will/ come up again, because it's not /that/ 
unreasonable or uncommon a situation to find oneself in, after all!

But definitely, the best-case would be if it results in the tools 
learning how to automate the process so people that have no clue what a 
hex editor even is can still have at least /some/ chance of recovering 
from it, where we're just lucky here that someone with the technical 
skill and just as importantly the time/motivation/determination to either 
get a fix or know exactly why it /could-not/ be fixed, happened to have 
the problem, not someone more like me that /might/ have the technical 
skill, but would be far more likely to just accept the damage as reality 
and fall back to the backups such as they are, than actually invest the 
time in either getting that fix or knowing for sure that it /can't/ be 
fixed.

The signature I've seen, something about the unreasonable man refusing to 
accept reality, thereby making his own, and /thereby/, changing it for 
the good, for everyone, thus progress depending on the unreasonable man, 
comes to mind. =:^)

Yes, I suppose I /did/ just call you "unreasonable", but that's a rather 
extreme compliment, in this case! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs balance problems

2017-12-27 Thread Duncan

lances/scrubs/etc normally take under a minute, so I use the no-
backgrounding option where necessary and normally wait for it to 
complete, tho I sometimes switch to doing something else for a minute or 
so in the mean time.  Tho of course if something goes really wrong, like 
an ssd failing, I'll have multiple btrfs to deal with, as I have it 
partitioned up, with multiple pair-device btrfs using a partition on it 
for one device of their pair.

[2] Balance and check reflink costs:  Some people just bite the bullet 
and don't worry about balance and check times because with their use-
cases, falling back to backup and redoing the filesystem from scratch is 
simpler/faster and more reliable than trying to balance to a different 
btrfs layout or check their way out of trouble.

[3] Ionicing btrfs balance kernel worker threads:  Simplest would be to 
have balance take parameters for it to hand the kernel btrfs to use when 
it kicks off the threads, like scrub apparently does.  Lacking that, I 
can envision some daemon watching for such threads and ionicing them as 
it finds them.  But that's way more complicated than just feeding the 
options to a btrfs balance commandline as can be done with scrub, and 
with a bit of luck, especially because you /are/ after all already 
running ssd, /may/ be unnecessary once the above suggestions are taken 
into account.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-22 Thread Duncan

Tomasz Pala posted on Sat, 23 Dec 2017 03:52:47 +0100 as excerpted:

> On Fri, Dec 22, 2017 at 14:04:43 -0700, Chris Murphy wrote:
> 
>> I'm pretty sure degraded boot timeout policy is handled by dracut. The
> 
> Well, last time I've checked dracut on systemd-system couldn't even
> generate systemd-less image.

??

Unless it changed recently (I /chose/ a systemd-based dracut setup here, 
so I'd not be aware if it did), dracut can indeed do systemd-less initr* 
images.  Dracut is modular, and systemd is one of the modules, enabled by 
default on a systemd system, but not required, as I know, because I had 
dracut setup without the systemd module for some time after I switched to 
systemd for my main sysinit, and I verified it didn't install systemd in 
the initr* until I activated the systemd module.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-22 Thread Duncan

Tomasz Pala posted on Sat, 23 Dec 2017 05:08:16 +0100 as excerpted:

> On Tue, Dec 19, 2017 at 17:08:28 -0700, Chris Murphy wrote:
> 
>>>>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
>>>>> safely add "degraded" to the mount options? My primary concern is
>>>>> the
>>> [...]
>> 
>> Well it only does rw once, then the next degraded is ro - there are
>> patches dealing with this better but I don't know the state. And
>> there's no resync code that I'm aware of, absolutely it's not good
>> enough to just kick off a full scrub - that has huge performance
>> implications and I'd consider it a regression compared to functionality
>> in LVM and mdadm RAID by default with the write intent bitmap.  Without
>> some equivalent short cut, automatic degraded means a
> 
> I read about the 'scrub' all over the time here, so let me ask this
> directly, as this is also not documented clearly:
> 
> 1. is the full scrub required after ANY desync? (like: degraded mount
> followed by readding old device)?

It is very strongly recommended.

> 2. if the scrub is omitted - is it possible that btrfs return invalid
> data (from the desynced and readded drive)?

Were invalid data returned it would be a bug.  However, a reasonably 
common refrain here is that btrfs is "still stabilizing, not yet fully 
stable and mature", so occasional bugs can be expected, tho both the 
ideal and experience suggests that they're gradually reducing in 
frequency and severity as time goes on and we get closer to "fully stable 
and mature".

Which of course is why both having usable and tested backups, and keeping 
current with the kernel, are strongly recommended as well, the first in 
case one of those bugs does hit and it's severe enough to take out your 
working btrfs, the second because later kernels have fewer known bugs in 
the first place.

Functioning as designed as as intent-coded, in the case of a desync, 
btrfs will use the copy with the latest generation/transid serial, and 
thus should never return older data from the desynced device.  Further, 
btrfs is designed to be self-healing and will transparently rewrite the 
out-of-sync copy, syncing it in the process, as it comes across each 
stale block.

But the only way to be sure everything's consistent again is that scrub, 
and of course if something should happen to the only current copy while 
the desync still has the other copy stale, /then/ you lose data.

And as I said, that's functioning as designed and intent-coded, assuming 
no bugs, an explicitly unsafe assumption given btrfs' "still stabilizing" 
state.

So... "strongly recommended" indeed, tho in theory it shouldn't be 
absolutely required as long as unlucky fate doesn't strike before the 
data is transparently synced in normal usage.  YMMV, but I definitely do 
those scrubs here.

> 3. is the scrub required to be scheduled on regular basis? By 'required'
> I mean by design/implementation issues/quirks, _not_ related to possible
> hardware malfunctions.

Perhaps I'm tempting fate, but I don't do scheduled/regular scrubs here.  
Only if I have an ungraceful shutdown or see complaints in the log (which 
I tail to a system status dashboard so I'd be likely to notice a problem 
one way or the other pretty quickly).

But I do keep those backups, and while it has been quite some time (over 
a year, I'd say about 18 months to two years, and I was actually able to 
use btrfs restore and avoid having to use the backups themselves the last 
time it happened even 18 months or whatever ago) now since I had to use 
them, I /did/ actually spend some significant money upgrading my backups 
to all-SSD in ordered to make updating those backups easier and encourage 
me to keep them much more current than I had been (btrfs restore saved me 
more trouble than I'm comfortable admitting, given that I /did/ have 
backups, but they weren't the freshest at the time).

If as some people I had my backups offsite and would have to download 
them if I actually needed them, I'd potentially be rather stricter and 
schedule regular scrubs.

So by design and intention-coding, no, regularly scheduled scrubs aren't 
"required".  But I'd treat them the same as I would on non-btrfs raid, or 
a bit stricter given the above discussed btrfs stability status.  If 
you'd be uncomfortable not scheduling regular scrubs on your non-btrfs 
raid, you better be uncomfortable not scheduling them on btrfs as well!

And as always, btrfs or no btrfs, scrub or no scrub, have your backups or 
you are literally defining your data as not worth the time/trouble/
resources necessary to do them, and some day, maybe 10 minutes from now, 
maybe 10 years from now, fate's going to call you on that definition!

(Yes, I know /you/ know that or we'd not have this thread,

Re: kernel hangs during balance

2017-12-20 Thread Duncan

Holger Hoffstätte posted on Wed, 20 Dec 2017 20:58:14 +0100 as excerpted:

> On 12/20/17 20:02, Chris Murphy wrote:
>> I don't know if it's the sending MUA or the list server, but the line
>> wrapping makes this much harder to follow. I suggest putting it in a
>> text file and attaching the text file. It's definitely not on the
>> receiving side, I see it here also:
>> https://www.spinics.net/lists/linux-btrfs/msg72872.html
> 
> You can see enough to suggest that blk-mq is hanging, which is
> "unsurprising" (being kind here) with such an old kernel. Rich, build
> your kernel with CONFIG_SCSI_MQ_DEFAULT=n or boot with
> scsi_mod.use_blk_mq=n as kernel prarameter.

Well, the kernel is 4.4 "el repo".  4.4 /is/ an LTS (and elrepo is AFAIK 
a red hat backports repo, not sure how official, but useful for people on 
red hat), but it's now the second-back LTS, with 4.9 and the new 4.14 
being newer LTS series.

The thing is that this is the btrfs list, and we're development and thus 
rather forward focused here.  As such, we normally want at /least/ the 
second newest (first back) lts series kernel for best chance at 
reasonable support.  While I understand people who want to stick with LTS 
being reluctant to go with the /newest/ LTS before even a single current 
release has passed, making that LTS still a bit new as such things go, 
certainly the one-back LTS, 4.9, should be reasonable.

So yes, tho 4.4 is at least an LTS, for purposes of btrfs and this list, 
it really is rather old now, and an upgrade, presumably to 4.9 in keeping 
with the LTS theme, would be recommended.  If the issue can be confirmed 
on the current and LTS 4.14, so much the better, but certainly, 4.9 is a 
recommended upgrade, and a bug there would still be in the concern for a 
fix range, while 4.4... really is just out of the focus range for this 
list, tho various longer focus distros will of course still provide 
support for it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-20 Thread Duncan

Austin S. Hemmelgarn posted on Wed, 20 Dec 2017 08:33:03 -0500 as
excerpted:

>> The obvious answer is: do it via kernel command line, just like mdadm
>> does:
>> rootflags=device=/dev/sda,device=/dev/sdb
>> rootflags=device=/dev/sda,device=missing
>> rootflags=device=/dev/sda,device=/dev/sdb,degraded
>> 
>> If only btrfs.ko recognized this, kernel would be able to assemble
>> multivolume btrfs itself. Not only this would allow automated degraded
>> mounts, it would also allow using initrd-less kernels on such volumes.
> Last I checked, the 'device=' options work on upstream kernels just
> fine, though I've never tried the degraded option.  Of course, I'm also
> not using systemd, so it may be some interaction with systemd that's
> causing them to not work (and yes, I understand that I'm inclined to
> blame systemd most of the time based on significant past experience with
> systemd creating issues that never existed before).

Has the bug where rootflags=device=/dev/sda1,device=/dev/sdb1 failed, 
been fixed?  Last I knew (which was ancient history in btrfs terms, but 
I've not seen mention of a patch for it in all that time either), device= 
on the userspace commandline worked, and device= on the kernel commandline 
worked if there was just one device, but it would fail for more than one 
device.  Mounting degraded (on a pair-device raid1) would then of course 
work, since it would just use the one device=, but that's simply 
dangerous for routine use regardless of whether it actually assembled or 
not, thus effectively forcing an initr* for multi-device btrfs root in 
ordered to get it mounted properly.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1780 matches

Mail list logo