Re: applications hang on a btrfs spanning two partitions
Marc Joliet posted on Tue, 15 Jan 2019 23:40:18 +0100 as excerpted: > Am Dienstag, 15. Januar 2019, 09:33:40 CET schrieb Duncan: >> Marc Joliet posted on Mon, 14 Jan 2019 12:35:05 +0100 as excerpted: >> > Am Montag, 14. Januar 2019, 06:49:58 CET schrieb Duncan: >> > >> >> ... noatime ... >> > >> > The one reason I decided to remove noatime from my systems' mount >> > options is because I use systemd-tmpfiles to clean up cache >> > directories, for which it is necessary to leave atime intact >> > (since caches are often Write Once Read Many). >> >> Thanks for the reply. I hadn't really thought of that use, but it >> makes sense... I really enjoy these "tips" subthreads. As I said I hadn't really thought of that use, and seeing and understanding other people's solutions helps when I later find reason to review/change my own. =:^) One example is an ssd brand reliability discussion from a couple years ago. I had the main system on ssds then and wasn't planning on an immediate upgrade, but later on, I got tired of the media partition and a main system backup being on slow spinning rust, and dug out that ssd discussion to help me decide what to buy. (Samsung 1 TB evo 850s, FWIW.) > Specifically, I mean ~/.cache/ (plus a separate entry for ~/.cache/ > thumbnails/, since I want thumbnails to live longer): Here, ~/.cache -> tmp/cache/ and ~/tmp -> /tmp/tmp-$USER/, plus XDG_CACHE_HOME=$HOME/tmp/cache/, with /tmp being tmpfs. So as I said, user cache is on tmpfs. Thumbnails... I actually did an experiment with the .thumbnails backed up elsewhere and empty, and found that with my ssds anyway, rethumbnailing was close enough to having them cached that it didn't really matter to my visual browsing experience. So not only do I not mind thumbnails being on tmpfs, I actually have gwenview, my primary images browser, set to delete its thumbnails dir on close. > I haven't bothered configuring /var/cache/, other than making it a > subvolume so it's not a part of my snapshots (overriding the systemd > default of creating it as a directory). It appears to me that it's > managed just fine by pre- existing tmpfiles.d snippets and by the > applications that use it cleaning up after themselves (except for > portage, see below). Here, /var/cache/ is on /, which remains mounted read-only by default. The only things using it are package-updates related, and I obviously have to mount / rw for package updates, so it works fine. (My sync script mounts the dedicated packages filesystem containing the repos, ccache, distdir, and binpkgs, and remounting / rw, and that's the first thing I run doing an update, so I don't even have to worry about doing the mounts manually.) >> FWIW systemd here too, but I suppose it depends on what's being cached >> and particularly on the expense of recreation of cached data. I >> actually have many of my caches (user/browser caches, etc) on tmpfs and >> reboot several times a week, so much of the cached data is only >> trivially cached as it's trivial to recreate/redownload. > > While that sort of tmpfs hackery is definitely cool, my system is, > despite its age, fast enough for me that I don't want to bother with > that (plus I like my 8 GB of RAM to be used just for applications and > whatever Linux decides to cache in RAM). Also, modern SSDs live long > enough that I'm not worried about wearing them out through my daily > usage (which IIRC was a major reason for you to do things that way). 16 gigs RAM here, and except for building chromium (in tmpfs), I seldom fill it even with cache -- most of the time several gigs remain entirely empty. With 8 gig I'd obviously have to worry a bit more about what I put in tmpfs, but given that I have the RAM space, I might as well use it. When I setup this system I was upgrading from a 4-core (original 2-socket dual-core 3-digit Opterons, purchased in 2003 and ran until the caps started dying in 2011), this system being a 6-core fx-series, and based on the experience with the quad-core, I figured 12 gig RAM for the 6- core. But with pairs of RAM sticks for dual-channel, powers of two worked better, so it was 8 gig or 16 gig. And given that I had worked with 8 gig on the quad-core, I knew that would be OK, but 12 gig would mean less cache dumping, so 16 gig it was. And my estimate was right on. Since 2011, I've typically run up to ~12 gigs RAM used including cache, leaving ~4 gigs of the 16 entirely unused most of the time, tho I do use the full 16 gig sometimes when doing updates, since I have PORTAGE_TMPDIR set to tmpfs. Of course since my purchase in 2011 I've upgraded to SSDs and RAM-based storage cache isn't as important
Re: applications hang on a btrfs spanning two partitions
Marc Joliet posted on Mon, 14 Jan 2019 12:35:05 +0100 as excerpted: > Am Montag, 14. Januar 2019, 06:49:58 CET schrieb Duncan: > [...] >> Unless you have a known reason not to[1], running noatime with btrfs >> instead of the kernel-default relatime is strongly recommended, >> especially if you use btrfs snapshotting on the filesystem. > [...] > > The one reason I decided to remove noatime from my systems' mount > options is because I use systemd-tmpfiles to clean up cache directories, > for which it is necessary to leave atime intact (since caches are often > Write Once Read Many). Thanks for the reply. I hadn't really thought of that use, but it makes sense... FWIW systemd here too, but I suppose it depends on what's being cached and particularly on the expense of recreation of cached data. I actually have many of my caches (user/browser caches, etc) on tmpfs and reboot several times a week, so much of the cached data is only trivially cached as it's trivial to recreate/redownload. OTOH, running gentoo, my ccache and binpkg cache are seriously CPU-cycle expensive to recreate, so you can bet those are _not_ tmpfs, but OTTH, they're not managed by systemd-tmpfiles either. (Ccache manages its own cache and together with the source-tarballs cache and git-managed repo trees along with binpkgs, I have a dedicated packages btrfs containing all of them, so I eclean binpkgs and distfiles whenever the 24-gigs space (48-gig total, 24-gig each on pair-device btrfs raid1) gets too close to full, then btrfs balance with -dusage= to reclaim partial chunks to unallocated.) Anyway, if you're not regularly snapshotting, relatime is reasonably fine, tho I'd still keep the atime effects in mind and switch to noatime if you end up in a recovery situation that requires writable mounting. (Losing a device in btrfs raid1 and mounting writable in ordered to replace it and rebalance comes to mind as one example of a writable-mount recovery scenario where noatime until full replace/rebalance/scrub completion would prevent unnecessary writes until the raid1 is safely complete and scrub-verified again.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: applications hang on a btrfs spanning two partitions
Florian Stecker posted on Sat, 12 Jan 2019 11:19:14 +0100 as excerpted: > $ mount | grep btrfs > /dev/sda8 on / type btrfs > (rw,relatime,ssd,space_cache,subvolid=5,subvol=/) Unlikely to be apropos to the problem at hand, but FYI... Unless you have a known reason not to[1], running noatime with btrfs instead of the kernel-default relatime is strongly recommended, especially if you use btrfs snapshotting on the filesystem. The reasoning is that even tho relatime reduces the default access-time updates to once a day, it still likely-unnecessarily turns otherwise read- only operations into read-write operations, and atimes are metadata, which btrfs always COWs (copy-on-writes), meaning atime updates can trigger cascading metadata block-writes and much larger than anticipated[2] write-amplification, potentially hurting performance, yes, even for relatime, depending on your usage. In addition, if you're using snapshotting and not using noatime, it can easily happen that a large portion of the change between one snapshot and the next is simply atime updates, thus making the space referenced exclusively by individual affected snapshots far larger than it would otherwise be. --- [1] mutt is AFAIK the only widely used application that still depends on atime updates, and it only does so in certain modes, not with mbox-format mailboxes, for instance. So unless you're using it, or your backup solution happens to use atime, chances are quite high that noatime won't disrupt your usage at all. [2] Larger than anticipated write-amplification: Especially when you /thought/ you were only reading the files and hadn't considered the atime update that read could trigger, thus effectively generating infinity write amplification because the read access did an atime update and turned what otherwise wouldn't be a write operation at all into one! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Balance of Raid1 pool, does not balance properly.
Karsten Vinding posted on Tue, 08 Jan 2019 20:40:12 +0100 as excerpted: > Hello. > > I have a Raid1 pool consisting of 6 drives, 3 3TB disks and 3 2TB disks. > > Until yesterday it consisted of 3 2TB disks, 2 3TB disks and one 1TB > disk. > > I replaced the 1TB disk as the pool was close to full. > > Replacement went well, and I ended up with 5 almost full disks, and 1 > 3TB disk that was one third full. > > So I kicked of a balance, expecting it to balance the data as evenly as > possible on the 6 disks (btrfs balace start poolname). > > The balance ran fine but I ended up with this: > > Total devices 6 FS bytes used 5.66TiB > devid 9 size 2.73TiB used 2.69TiB path /dev/sdf > devid 10 size 1.82TiB used 1.78TiB path /dev/sdb > devid 11 size 1.82TiB used 1.73TiB path /dev/sdc > devid 12 size 1.82TiB used 1.73TiB path /dev/sdd > devid 13 size 2.73TiB used 2.65TiB path /dev/sde > devid 15 size 2.73TiB used 817.87GiB path /dev/sdg > > The sixth drive sdg, is still only one third full. > > How do I force BTRFS to distribute the data more evenly across the > disks? > > The way BTRFS has done it now, will bring problems, when I write more > data to the array. After doing the btrfs replace to the larger device, did you resize to the full size of the larger device as noted in the btrfs-replace manpage (but before you do please post btrfs device usage from before, and then again after the resize, as below)? I ask because that's an easy to forget step that you don't specifically mention doing. If you didn't, that's your problem -- the filesystem on that device is still the size of the old device, and needs resized to the larger size of the new one, after which a balance should work as expected. Note that there is very recently reported bug in the way btrfs filesystem usage reports the size in this case, adding the device slack to unallocated altho it can't actually be allocated by the filesystem at all as the filesystem size doesn't cover that space on that device. I thought the bug didn't extend to show, which would indicate that you did the resize and just didn't mention it, but am asking as that's otherwise the most likely reason for the listed behavior. I /believe/ btrfs device usage indicates the extra space in its device slack line, but the reporter had already increased the size by the time of posting and hadn't run btrfs device usage previous to that, and it was non-dev list regulars in the discussion that didn't know for sure and didn't have a replaced and as yet unresized-filesystem device to check, so we haven't actually verified whether it displays correctly or not yet. Thus the request for the btrfs device usage output, to verify all that for both your case and the previous similar thread... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Undelete files
Jesse Emeth posted on Sun, 30 Dec 2018 16:58:12 +0800 as excerpted: > Hi Duncan > > The backup is irrelevant in this case. I have a backup of this > particular problem. > I've had BTRFS on my OS system blow up several times. > There are several snapshots of this within the subvolume. > However, such snapshots are not helpful unless they are snapshots > copied elsewhere with restore/rsync etc. How can backups and snapshots not be helpful in terms of a problem where you'd be using undelete? Undelete implies the filesystem is fine and that you're just trying to get a few files that you mistakenly deleted back, which in fact was the claim, and both backups and snapshots should allow you to do just that, get your deleted files back. > I had spoken to someone expressing my concerns with BTRFS on IRC. > He wanted me to present this so that such problems could be rectified. > I also wanted to learn more about BTRFS to see if my determinations > about its inadequacies were incorrect. > > Thus I want to follow this through to see if what is actually a very > very small problem related to just a non essential small Firefox cache > directory can actually be fixed. > At present this very very small problem brings down the entire volume > and all subvolumes with no way to mount any of it rw or easily fix the > issue. > That is not sane for such a small issue. That's not a file undelete issue. That's an entire filesystem issue. Quite a different beast, and not one that I directly addressed in my reply (altho the data value vs. backups stuff applies to fat-fingering such as mistaken deletes, filesystem problems, hardware problems, and natural disasters, all four), because both the title and the content suggested a file undelete issue, which /was/ addressed. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Checksum errors
,000+ (calculated from raw used value against the percentage value for cooked, I didn't have all the different ways of reporting it on mine that you have), so it took me quite awhile to work thru them even tho I was chewing them up rather regularly, toward the end, sometimes several hundred at a time. But while the "cooked" values are standardized to 253 (254/255 are reserved) or sometimes 100 (percentage) maximum, the raw values differ between manufacturers. I'm pretty sure mine (Corsair Neutron brand) were the number of 512-byte sectors so a couple K per MB and I had tens of MB of reserve, thus explaining the 5 digit raw used numbers while still saying 80+ percent good cooked, but yours may be counting in 2 MiB erase-blocks or some such, thus the far lower raw numbers. Or perhaps Samsung simply recognized that such huge numbers of reserve wasn't particularly practical, people replaced the drive before it got /that/ bad, and put those would-be reserves to higher usable capacity instead. Regardless, while the ssd may continue to be usable as cache for some time, I'd strongly suggest rotating it out of normal use for anything you value, or at LEAST increasing your number of backups and/or pairing it with something else in btrfs raid1, as I already had mine when I noticed it going bad, so I could continue to use it and watch it degrade, over time. I'd definitely *NOT* recommend trusting that ssd in single or raid0 mode, for anything of value that's not backed up, period. Whatever those raw events are measuring, 50% on the cooked value is waaayyy too low to continue to trust it, tho as a cache device or similar, where a block going out occasionally isn't a big deal, it may continue to be useful for years. FWIW, with my tens of thousands of reserve blocks and the device in btrfs raid1 with a known good device, I was able to use routine btrfs scrubs to clean up the damage for quite some time, IIRC 8 months or so, until it just got so bad I was doing scrubs and finding and correcting sometimes hundreds of errors on every reboot, and as I actually had a third ssd I had planned to put in something else and never did get it there, I finally decided I had had enough, and after one final scrub, I did a btrfs replace of the old device with the new one. But AFAIK it had only gotten down to 85 cooked value or so, even then. And there's no way I'd have considered the ssd usable at anything under say 92 cooked, as blocks were simply erroring out too often, had I not had btrfs raid1 mode and been able to scrub away the errors. Meanwhile, FWIW the other devices, both the good one of the original pair, and the replacement for the bad one, same make and model as the bad one, are still going today. One of them has a 5/reallocated-sector-count raw value of 17, still 100% on the cooked value, the other says 0-raw/253 cooked. (For many values including this one, a cooked value of 253 means entirely clean, with a single "event" it drops to 100%, and it goes from there based on calculated percentage.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Undelete files
Duncan posted on Sun, 30 Dec 2018 04:11:20 + as excerpted: > Adrian Bastholm posted on Sat, 29 Dec 2018 23:22:46 +0100 as excerpted: > >> Hello all, >> Is it possible to undelete files on BTRFS ? I just deleted a bunch of >> folders and would like to restore them if possible. >> >> I found this script >> https://gist.github.com/Changaco/45f8d171027ea2655d74 but it's not >> finding stuff .. > > That's an undelete-automation wrapper around btrfs restore... > >> ./btrfs-undelete /dev/sde1 ./foto /home/storage/BTRFS_RESTORE/ >> Searching roots... >> Trying root 389562368... (1/70) >> ... >> Trying root 37339136... (69/70) >> Trying root 30408704... (70/70) >> Didn't find './foto' > > That script is the closest thing to a direct undelete command that btrfs > has. However, there's still some chance... > > ** IMPORTANT ** If you still have the filesystem mounted read-write, > remount it read-only **IMMEDIATELY**, because every write reduces your > chance at recovering any of the deleted files. > > (More in another reply, but I want to get this sent with the above > ASAP.) First a question: Any chance you have a btrfs snapshot of the deleted files you can mount and recover from? What about backups? Note that a number of distros using btrfs have automated snapshotting setup, so it's possible you have a snapshot with the files safely available, and don't even know it. Thus the snapshotting question (more on backups below). It could be worth checking... Assuming no snapshot and no backup with those files... Disclaimer: I'm not a dev, just a btrfs user and list regular myself. Thus, the level of direct technical help I can give is limited, and much of what remains is more what to do different to prevent a next time, tho there's some additional hints about the current situation further down... Well the first thing in this case to note is the sysadmin's (yes, that's you... and me, and likely everyone here: [1]) first rule of backups: The true value of data isn't defined by any arbitrary claims, but by the number of backups of that data it is considered valuable enough to have. Thus, in the most literal way possible, not having a backup is simply defining the data as not worth the time/trouble/hassle to make one, and not having a second and third and... backup is likewise, simply defining the value of the data as not worth that one more level of backup. (Likewise, not having an /updated/ backup is simply defining the value of data in the delta between the current working copy and the last backup as of trivial value, because as soon as it's worth more than the time/ trouble/resources required to update the backup, by definition, the backup will be updated in accordance with the value of the data in that delta.) Thus, the fact that we're assuming no backup now means that that we already defined the data as of trivial value, not worth the time/trouble/ resources necessary to make even a single backup. Which means no matter what the loss or why, hardware, software or "wetware" failure (the latter aka fat-fingering, as here), or even disaster such as flood or fire, when it comes to our data we can *always* rest easy, because we *always* save what was of most value, either the data if we defined it as such by the backups we had of it, or the time/ trouble/resources that would have otherwise gone into the backup, if we judged the data to be of lower value than that one more level of backup. Which means there's a strict limit to the value of the data possibly lost, and thus a strict limit to the effort we're likely willing to put into recovery after that data loss risk factor appears to have evaluated to 1, before the recovery effort too becomes not worth the trouble. After all, if it /was/ worth the trouble, it would have also been worth the trouble to do that backup in the first place, and the fact that we don't have it means it wasn't worth that trouble. At least for me, looking at it from this viewpoint significantly lowers my stress during disaster recovery situations. There's simply not that much at risk, nor can there be, even in the event of losing "everything" (well, data-wise anyway, hardware, or for that matter, my life and/or health, family and friends, etc, unfortunately that's not as easy to backup as data!) to a fire or the like, since if there was more at risk, there's be backups (offsite backups in the fire/flood sort of case) we could fall back on should it come to that. That said, before-the-fact, it's an unknown risk-factor, while after-the- fact, that previously unknown risk-factor has evaluated to 100% chance of (at least apparent) data loss! It's actually rather l
Re: Undelete files
Adrian Bastholm posted on Sat, 29 Dec 2018 23:22:46 +0100 as excerpted: > Hello all, > Is it possible to undelete files on BTRFS ? I just deleted a bunch of > folders and would like to restore them if possible. > > I found this script > https://gist.github.com/Changaco/45f8d171027ea2655d74 but it's not > finding stuff .. That's an undelete-automation wrapper around btrfs restore... > ./btrfs-undelete /dev/sde1 ./foto /home/storage/BTRFS_RESTORE/ Searching > roots... > Trying root 389562368... (1/70) > ... > Trying root 37339136... (69/70) > Trying root 30408704... (70/70) > Didn't find './foto' That script is the closest thing to a direct undelete command that btrfs has. However, there's still some chance... ** IMPORTANT ** If you still have the filesystem mounted read-write, remount it read-only **IMMEDIATELY**, because every write reduces your chance at recovering any of the deleted files. (More in another reply, but I want to get this sent with the above ASAP.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Broken chunk tree - Was: Mount issue, mount /dev/sdc2: can't read superblock
Tomáš Metelka posted on Sun, 30 Dec 2018 01:48:23 +0100 as excerpted: > Ok, I've got it:-( > > But just a few questions: I've tried (with btrfs-progs v4.19.1) to > recover files through btrfs restore -s -m -S -v -i ... and following > events occurred: > > 1) Just 1 "hard" error: > ERROR: cannot map block logical 117058830336 length 1073741824: -2 Error > copying data for /mnt/... > (file which absence really doesn't pain me:-)) > > 2) For 24 files a I got "too much loops" warning (U mean this: "if > (loops >= 0 && loops++ >= 1024) { ..."). I've always answered yes but > I'm afraid these files are corrupted (at least 2 of them seems > corrupted). > > How much bad is this? Does the error mentioned in #1 mean that it's the > only file which is totally lost? I can live without those 24 + 1 files > so if #1 and #2 would be the only errors then I could say the recovery > was successful ... but I'm afraid things aren't such easy:-) In this context, the biggest thing to know about btrfs restore is that because it's meant as a recovery measure and if it comes to using restore, the assumption is that the priority is recovery of /anything/ even if the checksums don't match (a chance of recovery of possibly bad data is considered better than rejecting possibly bad data entirely), it doesn't worry too much about checksums (AFAIK it ignores data checksums entirely, not sure whether it checks metadata checksums or not, but it probably ignores failure in at least some cases if it does, because that's the point for a tool like this). Which means that anything recovered using btrfs recover doesn't have the usual btrfs checksums validation guarantees, and could very possibly be corrupt. However, that's mitigated by the fact that most filesystems don't even have built-in checksumming and validation in the first place, so the data on them could go bad even in normal operation, and unless it was obviously corrupted into not working, you'd likely never even know it, so btrfs restore ignoring checksums simply returns the data to the state it's /normally/ in on most filesystems, completely unverified. But if you happen to have checksums independently stored somewhere, or even just ordinary unvalidated backups you can compare against, and you're worried about the possibility of undiscovered corruption due to the restore, and/or you were using btrfs in part /because/ of its built- in checksum verification, it could be worth doing that verification run against your old checksum database or backups. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: btrfs fi usage -T shows unallocated column as total in drive, not total available to btrfs
Chris Murphy posted on Thu, 27 Dec 2018 16:37:55 -0700 as excerpted: [Context is btrfs reports when btrfs is smaller on a device than the device it is on. In this specific case it's due to btrfs replace to a larger device, before using btrfs filesystem resize to increase the size to that of the newer/larger device.] > OK let me see if I get this right. You're saying it's confusing that > 'btrfs fi sh' "devid size" does not change when doing a device replace; > whereas 'btrfs fi us' device specific "unallocated" does change, even > though you haven't yet done a resize. > > I kinda sorta agree. While "unallocated" becomes 6.53TiB for this > device, the idea it's unallocated suggests it could be allocated, which > before a resize it cannot be allocated. "It depends what the definition of "unallocated" is."[1] Arguably, just as "unallocated" includes space not yet allocated to data/ metadata/system chunks, it could be argued it should include space on the device not yet allocated to the filesystem as well. Clearly, that's what the coder of the btrfs filesystem usage functionality thought. By that view, "unallocated" includes "not yet allocated to the filesystem itself also, but available on the block device the filesystem is on, to be allocated to the filesystem should the admin decide to do so." OTOH, as the OP says it's still confusing, and as pointed out in a reply, it's btrfs _filesystem_ usage we're talking about here, not btrfs _device_ usage, and at minimum, _filesystem_ usage including space on the device that's not yet allocated to that filesystem is indeed confusing/ unintuitive, and arguably actually incorrect, particularly if the btrfs device usage report reports that space under its "device slack" line, which as admins we don't actually know at this point (it doesn't appear to be documented except presumably in the code itself). And arguably, if btrfs filesystem usage is to report it at all, it should be under a separate (additional) line, presumably device slack, if that's what the device usage version does with that line. --- [1] Quote paraphrases a famous US political/legal quote from some years ago... OT as to the merits, but if you wish the background, s/unallocated/is/ and google it using the search engine of your choice. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: btrfs fi usage -T shows unallocated column as total in drive, not total available to btrfs
Chris Murphy posted on Wed, 26 Dec 2018 17:36:19 -0700 as excerpted: > I'm not really following this. An fs resize is implied by any device > add, remove or replace command. In the case of replace, it will > efficiently copy the device being replaced to the designated drive, and > then once that succeeds resize the file system to reflect the size of > the replacement device. I'm also confused why devid 4 seems to be > present before and after your device replace, so I have to wonder if > your copy paste really worked out as intended? And also, what version of > kernel and btrfs-progs are you using? I thought... yes... Just checked the btrfs-replace manpage (v4.19.1) and it says: Note the filesystem has to be resized to fully take advantage of a larger target device; this can be achieved with btrfs filesystem resize :max /path So it does *not* auto-resize after the replace. Also, I'm not positive on this, and I don't see it mentioned in the manpage, but I /think/ replace (unlike add/remove) keeps the same devid for the new device. (And IIRC one of the devs commented that there's a devid 0 during the replace itself, but I'm unsure whether that's the source or the destination, that is, whether the old ID is switched to the new device at the beginning of the replace so the old one temporarily gets the 0 during the replace until it's deleted at the end, or end so the new one temporarily gets it until the id is transferred at the end. That was in the context of a draft patch that didn't yet account for the possibility of devid 0 during replace, and the comment was pointing out the possibility.) If that's correct then the devid 4 could indeed be the old device at first (when it refers to sda and has 164.5 GiB unallocated), but the new device later (when it refers to sdu and has 6.53 TiB unallocated), even before the resize, that being the point of confusion (6.53 TB unallocated even tho it can't actually use it as it hasn't been resized yet) that triggered the original post in the first place. To address that point, I suppose ideally there'd be another line when the filesystem's smaller than the available device size, device-space outside filesystem, or some such. Tho you are correct that fi show and fi df's output don't correspond exactly to fi usage, without some sort of decoder ring to translate between them, and even with the decoder ring, the numbers come out but slightly different things are reported. Meanwhile, while I normally want the filesystem usage info and thus use that command, for something like this I'd be specifically interested in the specific device's usage, and thus would use btrfs device usage, in place of or in addition to btrfs filesystem usage. It'd be interesting to see what device usage (as opposed to filesystem usage) did with the unreachable space in terms of reporting -- maybe it has that separate line tho I doubt it, but if not does it count it or not?. But that wasn't posted and presumably the query wasn't run while in the still-unresized state, and I guess it's a bit late now to get it... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: SATA/SAS mixed pool
Adam Borowski posted on Thu, 13 Dec 2018 08:29:05 +0100 as excerpted: > On Wed, Dec 12, 2018 at 09:31:02PM -0600, Nathan Dehnel wrote: >> Is it possible/safe to replace a SATA drive in a btrfs RAID10 pool with >> an SAS drive? > > For btrfs, a block device is a block device, it's not "racist". > You can freely mix and/or replace. If you want to, say, extend a SD > card with NBD to remote spinning rust, it works well -- tested :p FWIW (mostly for other readers not so much this particular case) the known exception/caveat to that is USB block devices, which do tend to have problems, tho some hardware is fine. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: HELP unmountable partition after btrfs balance to RAID0
Thomas Mohr posted on Thu, 06 Dec 2018 12:31:15 +0100 as excerpted: > We wanted to convert a file system to a RAID0 with two partitions. > Unfortunately we had to reboot the server during the balance operation > before it could complete. > > Now following happens: > > A mount attempt of the array fails with following error code: > > btrfs recover yields roughly 1.6 out of 4 TB. [Just another btrfs user and list regular, not a dev. A dev may reply to your specific case, but meanwhile, for next time...] That shouldn't be a problem. Because with raid0 a failure of any of the components will take down the entire raid, making it less reliable than a single device, raid0 (in general, not just btrfs) is considered only useful for data of low enough value that its loss is no big deal, either because it's truly of little value (internet cache being a good example), or because backups are kept available and updated for whenever the raid0 array fails. Because with raid0, it's always a question of when it'll fail, not if. So loss of a filesystem being converted to raid0 isn't a problem, because the data on it, by virtue of being in the process of conversion to raid0, is defined as of throw-away value in any case. If it's of higher value than that, it's not going to be raid0 (or in the process of conversion to it) in the first place. Of course that's simply an extension of the more general first sysadmin's rule of backups, that the true value of data is defined not by arbitrary claims, but by the number of backups of that data it's worth having. Because "things happen", whether it's fat-fingering, bad hardware, buggy software, or simply someone tripping over the power cable or running into the power pole outside at the wrong time. So no backup is simply defining the data as worth less than the time/ trouble/resources necessary to make that backup. Note that you ALWAYS save what was of most value to you, either the time/ trouble/resources to do the backup, if your actions defined that to be of more value than the data, or the data, if you had that backup, thereby defining the value of the data to be worth backing up. Similarly, failure of the only backup isn't a problem because by virtue of there being only that one backup, the data is defined as not worth having more than one, and likewise, having an outdated backup isn't a problem, because that's simply the special case of defining the data in the delta between the backup time and the present as not (yet) worth the time/hassle/resources to make/refresh that backup. (And FWIW, the second sysadmin's rule of backups is that it's not a backup until you've successfully tested it recoverable in the same sort of conditions you're likely to need to recover it in. Because so many people have /thought/ they had backups, that turned out not to be, because they never tested that they could actually recover the data from them. For instance, if the backup tools you'll need to recover the backup are on the backup itself, how do you get to them? Can you create a filesystem for the new copy of the data and recover it from the backup with just the tools and documentation available from your emergency boot media? Untested backup == no backup, or at best, backup still in process!) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: unable to fixup (regular) error
Alexander Fieroch posted on Mon, 26 Nov 2018 11:23:00 +0100 as excerpted: > Am 26.11.18 um 09:13 schrieb Qu Wenruo: >> The corruption itself looks like some disk error, not some btrfs error >> like transid error. > > You're right! SMART has an increased value for one harddisk on > reallocated sector count. Sorry, I missed to check this first... > > I'll try to salvage my data... FWIW as a general note about raid0 for updating your layout... Because raid0 is less reliable than a single device (failure of any device of the raid0 is likely to take it out, and failure of any one of N is more likely than failure of any specific single device), admins generally consider it useful only for "throw-away" data, that is, data that can be lost without issue either because it really /is/ "throw- away" (internet cache being a common example), or because it is considered a "throw-away" copy of the "real" data stored elsewhere, with that "real" copy being either the real working copy of which the raid0 is simply a faster cache, or with the raid0 being the working copy, but with sufficiently frequent backup updates that if the raid0 goes, it won't take anything of value with it (read as the effort to replace any data lost will be reasonably trivial, likely only a few minutes or hours, at worst perhaps a day's worth, of work, depending on how many people's work is involved and how much their time is considered to be worth). So if it's raid0, you shouldn't be needing to worry about trying to recover what's on it, and probably shouldn't even be trying to run a btrfs check on it at all as it's likely to be more trouble and take more time than the throw-away data on it is worth. If something goes wrong with a raid0, just declare it lost, blow it away and recreate fresh, restoring from the "real" copy if necessary. Because for an admin, really with any data but particularly for a raid0, it's more a matter of when it'll die than if. If that's inappropriate for the value of the data and status of the backups/real-copies, then you should really be reconsidering whether raid0 of any sort is appropriate, because it almost certainly is not. For btrfs, what you might try instead of raid0, is raid1 metadata at least, raid0 or single mode data if there's not room enough to do raid1 data as well. And the raid1 metadata would have very likely saved the filesystem in this case, with some loss of files possible depending on where the damage is, but with the second copy of the metadata from the good device being used to fill in for and (attempt to, if the bad device is actively getting worse it might be a losing battle) repair any metadata damage on the bad device, thus giving you a far better chance of saving the filesystem as a whole. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Filesystem mounts fine but hangs on access
Adam Borowski posted on Sun, 04 Nov 2018 20:55:30 +0100 as excerpted: > On Sun, Nov 04, 2018 at 06:29:06PM +0000, Duncan wrote: >> So do consider adding noatime to your mount options if you haven't done >> so already. AFAIK, the only /semi-common/ app that actually uses >> atimes these days is mutt (for read-message tracking), and then not for >> mbox, so you should be safe to at least test turning it off. > > To the contrary, mutt uses atimes only for mbox. Figures that I'd get it reversed. >> And YMMV, but if you do use mutt or something else that uses atimes, >> I'd go so far as to recommend finding an alternative, replacing either >> btrfs (because as I said, relatime is arguably enough on a traditional >> non-COW filesystem) or whatever it is that uses atimes, your call, >> because IMO it really is that big a deal. > > Fortunately, mutt's use could be fixed by teaching it to touch atimes > manually. And that's already done, for both forks (vanilla and > neomutt). Thanks. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Filesystem mounts fine but hangs on access
Sebastian Ochmann posted on Sun, 04 Nov 2018 14:15:55 +0100 as excerpted: > Hello, > > I have a btrfs filesystem on a single encrypted (LUKS) 10 TB drive which > stopped working correctly. > Kernel 4.18.16 (Arch Linux) I see upgrading to 4.19 seems to have solved your problem, but this is more about something I saw in the trace that has me wondering... > [ 368.267315] touch_atime+0xc0/0xe0 Do you have any atime-related mount options set? FWIW, noatime is strongly recommended on btrfs. Now I'm not a dev, just a btrfs user and list regular, and I don't know if that function is called and just does nothing when noatime is set, so you may well already have it set and this is "much ado about nothing", but the chance that it's relevant, if not for you, perhaps for others that may read it, begs for this post... The problem with atime, access time, is that it turns most otherwise read- only operations into read-and-write operations in ordered to update the access time. And on copy-on-write (COW) based filesystems such as btrfs, that can be a big problem, because updating that tiny bit of metadata will trigger a rewrite of the entire metadata block containing it, which will trigger an update of the metadata for /that/ block in the parent metadata tier... all the way up the metadata tree, ultimately to its root, the filesystem root and the superblocks, at the next commit (normally every 30 seconds or less). Not only is that a bunch of otherwise unnecessary work for a bit of metadata barely anything actually uses, but forcing most read operations to read-write obviously compounds the risk for all of those would-be read- only operations when a filesystem already has problems. Additionally, if your use-case includes regular snapshotting, with atime on, on mostly read workloads with few writes (other than atime updates), it may actually be the case that most of the changes in a snapshot are actually atime updates, making reoccurring snapshot updates far larger than they'd be otherwise. Now a few years ago the kernel did change the default to relatime, basically updating the atime for any particular file only once a day, which does help quite a bit, and on traditional filesystems it's arguably a reasonably sane default, but COW makes atime tracking enough more expensive that setting noatime is still strongly recommended on btrfs, particularly if you're doing regular snapshotting. So do consider adding noatime to your mount options if you haven't done so already. AFAIK, the only /semi-common/ app that actually uses atimes these days is mutt (for read-message tracking), and then not for mbox, so you should be safe to at least test turning it off. And YMMV, but if you do use mutt or something else that uses atimes, I'd go so far as to recommend finding an alternative, replacing either btrfs (because as I said, relatime is arguably enough on a traditional non-COW filesystem) or whatever it is that uses atimes, your call, because IMO it really is that big a deal. Meanwhile, particularly after seeing that in the trace, if the 4.19 update hadn't already fixed it, I'd have suggested trying a read-only mount, both as a test, and assuming it worked, at least allowing you to access the data without the lockup, which would have then been related to the write due to the atime update, not the actual read. Actually, a read-only mount test is always a good troubleshooting step when the trouble is a filesystem that either won't mount normally, or will, but then locks up when you try to access something. It's far lest risky than a normal writable mount, and at minimum it provides you the additional test data of whether it worked or not, plus if it does, a chance to access the data and make sure your backups are current, before actually trying to do any repairs. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: BTRFS did it's job nicely (thanks!)
waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted: > Note that I tend to interpret the btrfs de st / output as if the error > was NOT fixed even if (seems clearly that) it was, so I think the output > is a bit misleading... just saying... See the btrfs-device manpage, stats subcommand, -z|--reset option, and device stats section: -z|--reset Print the stats and reset the values to zero afterwards. DEVICE STATS The device stats keep persistent record of several error classes related to doing IO. The current values are printed at mount time and updated during filesystem lifetime or from a scrub run. So stats keeps a count of historic errors and is only reset when you specifically reset it, *NOT* when the error is fixed. (There's actually a recent patch, I believe in the current dev kernel 4.20/5.0, that will reset a device's stats automatically for the btrfs replace case when it's actually a different device afterward anyway. Apparently, it doesn't even do /that/ automatically yet. Keep that in mind if you replace that device.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Understanding BTRFS RAID0 Performance
ewer default is 16 KiB, while the old default was the (minimum for amd64/x86) 4 KiB, and the maximum is 64 KiB. See the mkfs.btrfs manpage for the details as there's a tradeoff, smaller sizes increase (metadata) fragmentation but decrease lock contention, while larger sizes pack more efficiently and are less fragmented but updating is more expensive. The change in default was because 16 KiB was a win over the old 4 KiB for most use- cases, but the 32 or 64 KiB options may or may not be, depending on use- case, and of course if you're bottlenecking on locks, 4 KiB may still be a win. Among all those, I'd be especially interested in what thread_pool=n does or doesn't do for you, both because it specifically mentions parallelization and because I've seen little discussion of it. space_cache=v2 may also be a big boost for you, if you're filesystems are the size the 6-device raid0 implies and are at all reasonably populated. (Metadata) nodesize may or may not make a difference, tho I suspect if so it'll be mostly on writes (but I'm not familiar with the specifics there so could be wrong). I'd be interested to see if it does. In general I can recommend the no_holes and skinny_metadata features but you may well already have them, and the noatime mount option, which you may well already be using as well. Similarly, I ensure that all my btrfs are mounted from first mount with autodefrag, so it's always on as the filesystem is populated, but I doubt you'll see a difference from that in your benchmarks unless you're specifically testing an aged filesystem that would be heavily fragmented on its own. There's one guy here who has done heavy testing on the ssd stuff and knows btrfs on-device chunk allocation strategies very well, having come up with a utilization visualization utility and been the force behind the relatively recent (4.16-ish) changes to the ssd mount option's allocation strategy. He'd be the one to talk to if you're considering diving into btrfs' on-disk allocation code, etc. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Understanding BTRFS RAID0 Performance
Wilson, Ellis posted on Thu, 04 Oct 2018 21:33:29 + as excerpted: > Hi all, > > I'm attempting to understand a roughly 30% degradation in BTRFS RAID0 > for large read I/Os across six disks compared with ext4 atop mdadm > RAID0. > > Specifically, I achieve performance parity with BTRFS in terms of > single-threaded write and read, and multi-threaded write, but poor > performance for multi-threaded read. The relative discrepancy appears > to grow as one adds disks. [...] > Before I dive into the BTRFS source or try tracing in a different way, I > wanted to see if this was a well-known artifact of BTRFS RAID0 and, even > better, if there's any tunables available for RAID0 in BTRFS I could > play with. The man page for mkfs.btrfs and btrfstune in the tuning > regard seemed...sparse. This is indeed well known for btrfs at this point, as it hasn't been multi-read-thread optimized yet. I'm personally more familiar with the raid1 case, where which one of the two copies gets the read is simply even/odd-PID-based, but AFAIK raid0 isn't particularly optimized either. The recommended workaround is (as you might expect) btrfs on top of mdraid. In fact, while it doesn't apply to your case, btrfs raid1 on top of mdraid0s is often recommended as an alternative to btrfs raid10, as that gives you the best of both worlds -- the data and metadata integrity protection of btrfs checksums and fallback (with writeback of the correct version) to the other copy if the first copy read fails checksum verification, with the much better optimized mdraid0 performance. So it stands to reason that the same recommendation would apply to raid0 -- just do single-mode btrfs on mdraid0, for better performance than the as yet unoptimized btrfs raid0. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Transaction aborted error -28 clone_finish_inode_update
David Goodwin posted on Thu, 04 Oct 2018 17:44:46 +0100 as excerpted: > While trying to run/use bedup ( https://github.com/g2p/bedup ) I > hit this : > > > [Thu Oct 4 15:34:51 2018] [ cut here ] > [Thu Oct 4 15:34:51 2018] BTRFS: Transaction aborted (error -28) > [Thu Oct 4 15:34:51 2018] WARNING: CPU: 0 PID: 28832 at > fs/btrfs/ioctl.c:3671 clone_finish_inode_update+0xf3/0x140 > [Thu Oct 4 15:34:51 2018] CPU: 0 PID: 28832 Comm: bedup Not tainted > 4.18.10-psi-dg1 #1 [snipping a bunch of stuff that I as a non-dev list regular can't do much with anyway] > [Thu Oct 4 15:34:51 2018] BTRFS: error (device xvdg) in > clone_finish_inode_update:3671: errno=-28 No space left > [Thu Oct 4 15:34:51 2018] BTRFS info (device xvdg): forced readonly > % btrfs fi us /filesystem/ > Overall: > Device size: 7.12TiB > Device allocated: 6.80TiB > Device unallocated:330.93GiB > Device missing: 0.00B > Used: 6.51TiB > Free (estimated): 629.87GiB (min: 629.87GiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data+Metadata,single: Size:6.80TiB, Used:6.51TiB > /dev/xvdf 1.69TiB > /dev/xvdg 3.12TiB > /dev/xvdi 1.99TiB > > System,single: Size:32.00MiB, Used:780.00KiB > /dev/xvdf 32.00MiB > > Unallocated: > /dev/xvdf 320.97GiB > /dev/xvdg 949.00MiB > /dev/xvdi 9.03GiB > > > I kind of think there is sufficient free space. at least globally > within the filesystem. > > Does it require balancing to redistribute the unallocated space better? > Or is something misbehaving? The latter, but unfortunately there's not much you can do about it at this point but wait for fixes, unless you want to spit up that huge filesystem into several smaller ones. In general, btrfs has at least four kinds of "space" that it can run out of, tho in your case it appears you're running mixed-mode so data and metadata space are combined into one. * Unallocated space: This is space that remains entirely unallocated in the filesystem. It matters most when the balance between data and metadata space gets off. This isn't a problem for you as in single mode space can be allocated from any device and you have one with hundreds of gigs unallocated. It also tends to be less of a problem on mixed-bg mode, which you're running, as there's no distinction in mixed-mode between data and metadata. * Data chunk space: * Metadata chunk space: Because you're running mixed-bg mode, there's no distinction between these two, but for normal mode, running out of one or the other while all the free space is allocated to chunks of the other type, can be a problem. * Global reserve: Taken from metadata, the global reserve is space the system won't normally use, that it tries to keep clear in ordered to be able to finish transactions once they're started, as btrfs' copy-on-write semantics means even deleting stuff requires a bit of additional space temporarily. This seems to actually be where the problem is, because currently, certain btrfs operations such as reflinking/cloning/snapshotting (that is, just what you were doing) don't really calculate the needed space correctly and use arbitrary figures, which can be *wildly* off, while conversely a bare half-gig of global-reserve for a huge 7+ TiB filesystem seems rather proportionally small. (Consider that my small pair-device btrfs raid1 root filesystem, 8-GiB/device, 16 GiB total, has a 16 MiB reserve, proportionally, your 7+ TB filesystem would have 7+ GiB reserve, but it only has a half GiB.) So relatively small btrfs' don't tend to run into the problem, because they have proportionally larger reserves to begin with. Plus they probably don't have proportionally as many snapshots/reflinks/etc, either, so the problem simply doesn't trigger for them. Now I'm not a dev and my own use-case doesn't include either snapshotting or deduping, so I haven't paid that much attention to the specifics, but I have seen some recent patches on-list that based on the explanations should go some way toward fixing this problem by using more realistic figures for global-reserve calculations. At this point those patches would be for 4.20 (which might be 5.0), or possibly 4.21, but the devs are indeed working on the problem and it should get better within a couple kernel cycles. Alternatively perhaps the global reserve size could be bumped up on such large filesystems, but let's see if the more realistic operations-reserve calculations can fix things, first, as arguably that shouldn't be necessary once the calculations aren't so arbitrarily wild. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: What to do with damaged root fllesystem (opensuse leap 42.2)
s and the greatest chance at fixing things or for restore, scraping files off the damaged filesystem. So before doing the btrfs restore, you should find a current btrfs-progs, 4.17.1 ATM, to do it with, as that should give you the best results. Try Fedora Rawhide or Arch (or the Gentoo I run), as they tend to have more current versions. Then you need some place to put the scraped files, a writable filesystem with enough space to put what you're trying to restore. Once you have some place to put the scraped files, with luck, it's a simple case of running... btrfs restore ... where ... is the damaged filesystem is the path on the writable filesystem where you want to dump the restored files and can include various options as found in the btrfs-restore manpage, like -m/--metadata if you want to try to restore owner/times/ perms for the files, -s/--symlinks if you want to try to restore them, -x/--xattr if you want to try to restore them, etc. You may want to do a dry-run with -D/--dry-run first, to get some idea of whether it's looking like it can restore many of the files or not, and thus, of the sort of free space you may need on the writable filesystem to store the files it can restore. If a simple btrfs restore doesn't seem to get anything, there is an advanced mode as well, with a link to the wiki page covering it in the btrfs-restore manpage, but it does get quite technical, and results may vary. You will likely need help with that if you decide to try it, but as they say, that's a bridge we can cross when/if we get to it, no need to deal with it just yet. Meanwhile, again, don't worry too much about whether you can recover anything here or not, because in any case you already have what was most important to you, either backups you can restore from if you considered the data worth having them, or the time and trouble you would have put into those backups, if you considered saving that more important than making the backups. So losing the data on the filesystem, whether from filesystem error as seems to be the case here, due to admin fat-fingering (the infamous rm -rf .* or alike), or due to physical device loss if the disks/ssds themselves went bad, can never be a big deal, because the maximum value of the data in question is always strictly limited to that of the point at which having a backup is more important than the time/ trouble/resources you save(d) by not having one. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: btrfs problems
Adrian Bastholm posted on Thu, 20 Sep 2018 23:35:57 +0200 as excerpted: > Thanks a lot for the detailed explanation. > Aabout "stable hardware/no lying hardware". I'm not running any raid > hardware, was planning on just software raid. three drives glued > together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would > this be a safer bet, or would You recommend running the sausage method > instead, with "-d single" for safety ? I'm guessing that if one of the > drives dies the data is completely lost Another variant I was > considering is running a raid1 mirror on two of the drives and maybe a > subvolume on the third, for less important stuff Agreed with CMurphy's reply, but he didn't mention... As I wrote elsewhere recently, don't remember if it was in a reply to you before you tried zfs and came back, or to someone else, so I'll repeat here, briefer this time... Keep in mind that on btrfs, it's possible (and indeed the default with multiple devices) to run data and metadata at different raid levels. IMO, as long as you're following an appropriate backup policy that backs up anything valuable enough to be worth the time/trouble/resources of doing so, so if you /do/ lose the array you still have a backup of anything you considered valuable enough to worry about (and that caveat is always the case, no matter where or how it's stored, value of data is in practice defined not by arbitrary claims but by the number of backups it's considered worth having of it)... With that backups caveat, I'm now confident /enough/ about raid56 mode to be comfortable cautiously recommending it for data, tho I'd still /not/ recommend it for metadata, which I'd recommend should remain the multi- device default raid1 level. That way, you're only risking a limited amount of raid5 data to the not yet as mature and well tested raid56 mode, the metadata remains protected by the more mature raid1 mode, and if something does go wrong, it's much more likely to be only a few files lost instead of the entire filesystem, as is at risk if your metadata is raid56 as well, the metadata including checksums will be intact so scrub should tell you what files are bad, and if those few files are valuable they'll be on the backup and easy enough to restore, compared to restoring the entire filesystem. But for most use-cases, metadata should be relatively small compared to data, so duplicating metadata as raid1 while doing raid5 for data should go much easier on the capacity needs than raid1 for both would. Tho I'd still recommend raid1 data as well for higher maturity and tested ability to use the good copy to rewrite the bad one if one copy goes bad (in theory, raid56 mode can use parity to rewrite as well, but that's not yet as well tested and there's still the narrow degraded-mode crash write hole to worry about), if it's not cost-prohibitive for the amount of data you need to store. But for people on a really tight budget or who are storing double-digit TB of data or more, I can understand why they prefer raid5, and I do think raid5 is stable enough for data now, as long as the metadata remains raid1, AND they're actually executing on a good backup policy. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: [RFC PATCH v2 0/4] btrfs-progs: build distinct binaries for specific btrfs subcommands
Axel Burri posted on Fri, 21 Sep 2018 11:46:37 +0200 as excerpted: > I think you got me wrong here: There will not be binaries with the same > filename. I totally agree that this would be a bad thing, no matter if > you have bin/sbin merged or not, you'll end up in either having a > collision or (even worse) rely on the order in $PATH. > > With this "separated" patchset, you can install a binary > "btrfs-subvolume-show", which has the same functionality as "btrfs > subvolume show" (note the whitespace/dash), ending up with: > > /sbin/btrfs > /usr/bin/btrfs-subvolume-show > /usr/bin/btrfs-subvolume-list I did get you wrong (and had even understood the separately named binaries from an earlier post, too, but forgot). Thanks. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: [RFC PATCH v2 0/4] btrfs-progs: build distinct binaries for specific btrfs subcommands
Axel Burri posted on Thu, 20 Sep 2018 00:02:22 +0200 as excerpted: > Now not everybody wants to install these with fscaps or setuid, but it > might also make sense to provide "/usr/bin/btrfs-subvolume-{show,list}", > as they now work for a regular user. Having both root/user binaries > concurrently is not an issue (e.g. in gentoo the full-featured btrfs > command is in "/sbin/"). That's going to be a problem for distros (or users like me with advanced layouts, on gentoo too FWIW) that have the bin/sbin merge, where one is a symlink to the other. FWIW I have both the /usr merge (tho reversed for me, so /usr -> . instead of having to have /bin and /sbin symlinks to /usr/bin) and the bin/sbin merge, along with, since I'm on amd64-nomultilib, the lib/lib64 merge. So: $$ dir -gGd /bin /sbin /usr /lib /lib64 drwxr-xr-x 1 35688 Sep 18 22:56 /bin lrwxrwxrwx 1 5 Aug 7 00:29 /lib -> lib64 drwxr-xr-x 1 78560 Sep 18 22:56 /lib64 lrwxrwxrwx 1 3 Mar 11 2018 /sbin -> bin lrwxrwxrwx 1 1 Mar 11 2018 /usr -> . Of course that last one (/usr -> .) leads to /share and /include hanging directly off of / as well, but it works. But in that scheme /bin, /sbin, /usr/bin and /usr/sbin, are all the same dir, so only one executable of a particularly name can exist therein. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
Tomasz Chmielewski posted on Wed, 19 Sep 2018 10:43:18 +0200 as excerpted: > I have a mysql slave which writes to a RAID-1 btrfs filesystem (with > 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full. > > The slave receives around 0.5-1 MB/s of data from the master over the > network, which is then saved to MySQL's relay log and executed. In ideal > conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s > of data written to disk. > > MySQL directory and files in it are chattr +C (since the directory was > created, so all files are really +C); there are no snapshots. > > > Now, an interesting thing. > > When the filesystem is mounted with these options in fstab: > > defaults,noatime,discard > > > We can see a *constant* write of 25-100 MB/s to each disk. The system is > generally unresponsive and it sometimes takes long seconds for a simple > command executed in bash to return. > > > However, as soon as we remount the filesystem with space_cache=v2 - > writes drop to just around 3-10 MB/s to each disk. If we remount to > space_cache - lots of writes, system unresponsive. Again remount to > space_cache=v2 - low writes, system responsive. > > > That's a huuge, 10x overhead! Is it expected? Especially that > space_cache=v1 is still the default mount option? The other replies are good but I've not seen this pointed out yet... Perhaps you are accounting for this already, but you don't /say/ you are, while you do mention repeatedly toggling the space-cache options, which would trigger it so you /need/ to account for it... I'm not sure about space_cache=v2 (it's probably more efficient with it even if it does have to do it), but I'm quite sure that space_cache=v1 takes some time after initial mount with it to scan the filesystem and actually create the map of available free space that is the space_cache. Now you said ssds, which should be reasonably fast, but you also say 3- device btrfs raid1, with each device ~2TB, and the filesystem ~40% full, which should be ~2 TB of data, which is likely somewhat fragmented so it's likely rather more than 2 TB of data chunks to scan for free space, and that's going to take /some/ time even on SSDs! So if you're toggling settings like that in your tests, be sure to let the filesystem rebuild its cache that you just toggled and give it time to complete that and quiesce, before you start trying to measure write amplification. Otherwise it's not write-amplification you're measuring, but the churn from the filesystem still trying to reset its cache after you toggled it! Also, while 4.17 is well after the ssd mount option (usually auto- detected, check /proc/mounts, mount output, or dmesg, to see if the ssd mount option is being added) fixes that went in in 4.14, if the filesystem has been in use for several kernel cycles and in particular before 4.14, with the ssd mount option active, and you've not rebalanced since then, you may well still have serious space fragmentation from that, which could increase the amount of data in the space_cache map rather drastically, thus increasing the time it takes to update the space_cache, particularly v1, after toggling it on. A balance can help correct that, but it might well be easier and should result in a better layout to simply blow the filesystem away with a mkfs.btrfs and start over. Meanwhile, as Remi already mentioned, you might want to reconsider nocow on btrfs raid1, since nocow defeats checksumming and thus scrub, which verifies checksums, simply skips it, and if the two copies get out of sync for some reason... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: GRUB writing to grubenv outside of kernel fs code
Chris Murphy posted on Tue, 18 Sep 2018 13:34:14 -0600 as excerpted: > I've run into some issue where grub2-mkconfig and grubby, can change the > grub.cfg, and then do a really fast reboot without cleanly unmounting > the volume - and what happens? Can't boot. The bootloader can't do log > replay so it doesn't see the new grub.cfg at all. If all you do is mount > the volume and unmount, log replay happens, the fs metadata is all fixed > up just fine, and now the bootloader can see it. > This same problem can happen with the kernel and initramfs > installations. > > (Hilariously the reason why this can happen is because of a process > exempting itself from being forcibly killed by systemd *against* the > documented advice of systemd devs that you should only do this for > processes not on rootfs; but as a consequence of this process doing the > wrong thing, systemd at reboot time ends up doing an unclean unmount and > reboot because it won't kill the kill exempt process.) That's... interesting! FWIW here I use grub2, but as many admins I'm quite comfortable with bash, and the high-level grub2 config mechanisms simply didn't let me do what I needed to do. So I had to learn the lower-level grub bash-like scripting language to do what I wanted to do, and I even go so far as to install-mask some of the higher level stuff so it doesn't get installed at all, and thus can't somehow run and screw up my config. So I edit my grub scripts (and grubenv) much like I'd edit any other system script (and its separate config file where I have them) I might need to update, then save my work, and with both a bios-boot partition setup for grub-core and an entirely separate /boot that's not routinely mounted unless I'm updating it, I normally unmount it when I'm done, before I actually reboot. So I've never had systemd interfere. (And of course I have backups. In fact, on my main personal system, with both the working root and its primary backup being btrfs pair-device raid1 on separate devices, I have four physical ssds installed, with a bios-boot partition with grub installed and a separate dedicated (btrfs dup mode) /boot on each of all four, so I have a working grub and /boot and three backups, each of which I can point the bios at and have tested separately as bootable. So if upgrading grub or anything on /boot goes wrong I find that out testing the working copy, and boot one of the backups to resolve the problem before eventually upgrading all three backups after the working copy upgrade is well tested.) > So *already* we have file systems that are becoming too complicated for > the bootloader to reliably read, because they cannot do journal relay, > let alone have any chance of modifying (nor would I want them to do > this). So yeah I'm, very rapidly becoming opposed to grubenv on anything > but super simple volumes like maybe ext4 without a journal (extents are > nice); or even perhaps GRUB should just implement its own damn file > system and we give it its own partition - similar to BIOS Boot - but > probably a little bigger You realize that solution is already standardized as EFI and its standard FAT filesystem, right? =:^) >>> but is the bootloader overwrite of gruvenv going to recompute parity >>> and write to multiple devices? Eek! >> >> Recompute the parity should not be a big deal. Updating all the >> (b)trees would be a too complex goal. > > I think it's just asking for trouble. Sometimes the best answer ends up > being no, no and definitely no. Agreed. I actually /like/ the fact that at the grub prompt I can rely on everything being read-only, and if that SuSE patch to put grubenv in the reserved space and make it writable gets upstreamed, I really hope there's a build-time configure option to disable the feature, because IMO grub doesn't /need/ to save state at that point, and allowing it to do so is effectively needlessly playing a risky Russian Roulette game with my storage devices. Were it actually needed that'd be different, but it's not needed, so any risk is too much risk. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: btrfs panic problem
ountable filesystem. So for routine operation, it's no big deal if userspace is a bit old, at least as long as it's new enough to have all the newer command formats, etc, that you need, and for comparing against others when posted. But once things go bad on you, you really want the newest btrfs-progs in ordered to give you the best chance at either fixing things, or worst- case, at least retrieving the files off the dead filesystem. So using the older distro btrfs-progs for routine running should be fine, but unless your backups are complete and frequent enough that if something goes wrong it's easiest to simply blow the bad version away with a fresh mkfs and start over, you'll probably want at least a reasonably current btrfs-progs on your rescue media at least. Since the userspace version numbers are synced to the kernel cycle, a good rule of thumb is keep your btrfs-progs version to at least that of the oldest recommended LTS kernel version, as well, so you'd want at least btrfs-progs 4.9 on your rescue media, for now, and 4.14, coming up, since when the new kernel goes LTS that'll displace 4.9 and 4.14 will then be the second-back LTS. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: state of btrfs snapshot limitations?
early, all at (nearly) the same time, and then simply deleting all in the appropriate directory beyond some cap time, instead of the thinning logic of the above traditional model, wouldn't actually be much less efficient in terms of snapshot taking, because snapshotting is /designed/ to be fast, while at the same time it would significantly simplify the logic of the deletion scripts since they could simply delete everything older than X, instead of having to do conditional thinning logic. So your scheme with period slotting and capping as opposed to simply timestamping and thinning, is a new thought to me, but I like the idea for its simplicity, and as I said, it shouldn't really "cost" more, because taking snapshots is fast and relatively cost-free. =:^) I'd still recommend taking it easy on the yearly, tho, perhaps beyond a year or two, preferring physically media swapping and archiving at the yearly level if yearly archiving is found necessary at all. And depending on your particular needs, physical-swap archiving at six months or even quarterly might actually be appropriate, especially given that (with spinning rust at least, I guess ssds retain best with periodic power-up) on-the-shelf archiving should be more dependable as a last- resort backup. Or do similar online with for example Amazon Glacier (never used personally, tho I actually have the site open for reference as I write this and at US $0.004 per gig per month... so say $100 for a TB for 2 years or a couple hundred gig for a decade, $10/yr with a much better chance at actually being able to use it after a fire/flood/etc that'd take out anything local, tho actually retrieving it would cost a bit too... I'm actually thinking perhaps I should consider it... obviously I'd well encrypt first... until now I'd always done onsite backup only, figuring if I had a fire or something that'd be the last thing I'd be worried about, but now I'm actually considering...) OK, so I guess the bottom-line answer is "it depends." But the above should give you more data to plugin for your specific use-case. But if it's pure backup, you don't expect to expand to more devices in- place and you can blow it away and don't have to consider check --repair, AND you can do a couple filesystems so as to keep your daily snapshots separate from the more frequent backups and thus avoid snapshot deletion, you may actually be able to do the 365 dailies for 2-3 years then swap- out filesystems and devices without deleting snapshots, thus avoiding any of the maintenance-scaling issues that are the big limitation, and have it work just fine. OTOH, if you're use-case is a bit more conventional, with more maintenance to have to worry about scaling, capping to 100 snapshots remains a reasonable recommendation, and if you need quotas as well and can't afford to disable them even temporarily for a balance, you may find under 50 snapshots to be your maintenance pain tolerance threshold. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: List of known BTRFS Raid 5/6 Bugs?
often aren't even aware of the tradeoffs they're taking on those solutions, so... I suppose when it's all said and done the only people aware of the issues on btrfs are likely going to be the highly technical and case-optimizer crowds, too. Everyone else will probably just use the defaults and not even be aware of the tradeoffs they're making by doing so, as is already the case on mdraid and zfs. --- [1] As I'm no longer running either mdraid or parity-raid, I've not followed this extremely closely, but writing this actually spurred me to google the problem and see when and how mdraid fixed it. So the links are from that. =:^) [2] Journalling/journaling, one or two Ls? The spellcheck flags both and last I tried googling it the answer was inconclusive. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: List of known BTRFS Raid 5/6 Bugs?
7;s merged as well. Don't just jump on it immediately after merge unless you're deliberately doing so to help test for bugs and get them fixed and the feature stabilized as soon as possible. Wait a few kernel cycles, follow the list to see how the feature's stability is coming, and /then/ use it, after factoring in its remaining then still new and less mature additional risk into your backup risks profile, of course. Time? Not a dev but following the list and obviously following the new 3- way-mirroring, I'd say probably not 4.20 (5.0?) for the new mirroring modes, so 4.21/5.1 more reasonably likely (if all goes well, could be longer), probably another couple cycles (if all goes well) after that for the parity-raid logging code built on top of the new mirroring modes, so perhaps a year (~5 kernel cycles) to introduction for it. Then wait however many cycles until you think it has stabilized. Call that another year. So say about 10 kernel cycles or two years. It could be a bit less than that, say 5-7 cycles, if things go well and you take it before I'd really consider it stable enough to recommend, but given the historically much longer than predicted development and stabilization times for raid56 already, it could just as easily end up double that, 4-5 years out, too. But raid56 logging mode for write-hole mitigation is indeed actively being worked on right now. That's what we know at this time. And even before that, right now, raid56 mode should already be reasonably usable, especially if you do data raid5/6 and metadata raid1, as long as your backup policy and practice is equally reasonable. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Re-mounting removable btrfs on different device
Remi Gauvin posted on Thu, 06 Sep 2018 20:54:17 -0400 as excerpted: > I'm trying to use a BTRFS filesystem on a removable drive. > > The first drive drive was added to the system, it was /dev/sdb > > Files were added and device unmounted without error. > > But when I re-attach the drive, it becomes /dev/sdg (kernel is fussy > about re-using /dev/sdb). > > btrfs fi show: output: > > Label: 'Archive 01' uuid: 221222e7-70e7-4d67-9aca-42eb134e2041 > Total devices 1 FS bytes used 515.40GiB > devid1 size 931.51GiB used 522.02GiB path /dev/sdg1 > > This causes BTRFS to fail mounting the device [errors snipped] > I've seen some patches on this list to add a btrfs device forget option, > which I presume would help with a situation like this. Is there a way > to do that manually? Without the mentioned patches, the only way (other than reboot) is to remove and reinsert the btrfs kernel module (assuming it's a module, not built-in), thus forcing it to forget state. Of course if other critical mounted filesystems (such as root) are btrfs, or if btrfs is a kernel-built-in not a module and thus can't be removed, the above doesn't work and a reboot is necessary. Thus the need for those patches you mentioned. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: IO errors when building RAID1.... ?
Chris Murphy posted on Fri, 31 Aug 2018 13:02:16 -0600 as excerpted: > If you want you can post the output from 'sudo smartctl -x /dev/sda' > which will contain more information... but this is in some sense > superfluous. The problem is very clearly a bad drive, the drive > explicitly report to libata a write error, and included the sector LBA > affected, and only the drive firmware would know that. It's not likely a > cable problem or something like. And that the write error is reported at > all means it's persistent, not transient. Two points: 1) Does this happen to be an archive/SMR (shingled magnetic recording) device? If so that might be the problem as such devices really aren't suited to normal usage (they really are designed for archiving), and btrfs' COW patterns can exacerbate the issue. It's quite possible that the original install didn't load up the IO as heavily as the balance- convert does, so the problem appears with convert but not for install. 2) Assuming it's /not/ an SMR issue, and smartctl doesn't say it's dying, I'd suggest running badblocks -w (make sure the device doesn't have anything valuable on it!) on the device -- note that this will take awhile, probably a couple days perhaps longer, as it writes four different patterns to the entire device one at a time, reading everything back to verify the pattern was written correctly, so it's actually going over the entire device 8 times, alternating write and read, but it should settle the issue of the reliability of the device. Or if you'd rather spend the money than the time and it's not under warrantee still, just replace it, or at least buy a new one to use while you run the tests on that one. I fully understand that tying up the thing running tests on it for days straight may not be viable. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: How to erase a RAID1 (+++)?
Alberto Bursi posted on Fri, 31 Aug 2018 14:54:46 + as excerpted: > I just keep around a USB drive with a full Linux system on it, to act as > "recovery". If the btrfs raid fails I boot into that and I can do > maintenance with a full graphical interface and internet access so I can > google things. I do very similar, except my "recovery boot" is my backup (with normally including for root two levels of backup/recovery available, three for some things). I've actually gone so far as to have /etc/fstab be a symlink to one of several files, depending on what version of root vs. the off-root filesystems I'm booting, with a set of modular files that get assembled by scripts to build the fstabs as appropriate. So updating fstab is a process of updating the modules, then running the scripts to create the actual fstabs, and after I update a root backup the last step is changing the symlink to point to the appropriate fstab for that backup, so it's correct if I end up booting from it. Meanwhile, each root, working and two backups, is its own set of two device partitions in btrfs raid1 mode. (One set of backups is on separate physical devices, covering the device death scenario, the other is on different partitions on the same, newer and larger pair of physical devices as the working set, so it won't cover device death but still covers fat-fingering, filesystem fubaring, bad upgrades, etc.) /boot is separate and there's four of those (working and three backups), one each on each device of the two physical pairs, with the bios able to point to any of the four. I run grub2, so once the bios loads that, I can interactively load kernels from any of the other three /boots and choose to boot any of the three roots. And I build my own kernels, with an initrd attached as an initramfs to each, and test that they boot. So selecting a kernel by definition selects its attached initramfs as well, meaning the initr*s are backed up and selected with the kernels. (As I said earlier it'd sure be nice to be able to do away with the initr*s again. I was actually thinking about testing that today, which was supposed to be a day off, but got called in to work, so the test will have to wait once again...) What's nice about all that is that just as you said, each recovery/backup is a snapshot of the working system at the time I took the backup, so it's not a limited recovery boot at all, it has the same access to tools, manpages, net, X/plasma, browsers, etc, that my normal system does, because it /is/ my normal system from whenever I took the backup. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: How to erase a RAID1 (+++)?
purposes, go right ahead! That's what btrfs raid1 is for, after all. But if you were planning on mounting degraded (semi-)routinely, please do reconsider, because it's just not ready for that at this point, and you're going to run into all sorts of problems trying to do it on an ongoing basis due to the above issues. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: DRDY errors are not consistent with scrub results
Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted: > Thinking again, this is totally acceptable. If the requirement was a > good health disk, then I think I must check the disk health by myself. > I may believe that the disk is in a good state, or make a quick test or > make some very detailed tests to be sure. For testing you might try badblocks. It's most useful on a device that doesn't have a filesystem on it you're trying to save, so you can use the -w write-test option. See the manpage for details. The -w option should force the device to remap bad blocks where it can as well, and you can take your previous smartctl read and compare it to a new one after the test. Hint if testing multiple spinning-rust devices: Try running multiple tests at once. While this might have been slower on old EIDE, at least with spinning rust, on SATA and similar you should be able to test multiple devices at once without them slowing down significantly, because the bottleneck is the spinning rust, not the bus, controller or CPU. I used badblocks years ago to test my new disks before setting up mdraid on them, and with full disk tests on spinning rust taking (at the time) nearly a day a pass and four passes for the -w test, the multiple tests at once trick saved me quite a bit of time! It's not a great idea to do the test on new SSDs as it's unnecessary wear, writing the entire device four times with different patterns each time for a -w, but it might be worthwhile to try it on an ssd you're just trying to salvage, forcing it to swap out any bad sectors it encounters in the process. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: btrfs-convert missing in btrfs-tools v4.15.1
Nicholas D Steeves posted on Thu, 23 Aug 2018 14:15:18 -0400 as excerpted: >> It's in my interest to ship all tools in distros, but there's also only >> that much what the upstream community can do. If you're going to >> reconsider the status of btrfs-convert in Debian, please let me know. > > Yes, I'd be happy to advocate for its reinclusion if the answer to 4/5 > of the following questions is "yes". Does SUSE now recommend the use of > btrfs-convert to its enterprise customers? The following is a > frustrating criteria, but: Can a random desktop user run btrfs-convert > against their ext4 rootfs and expect the operation to succeed? Is > btrfs-convert now sufficiently trusted that it can be recommended with > the same degree of confidence as a backup, mkfs.btrfs, then restore to > new filesystem approach? Does the user of a btrfs volume created with > btrfs-convert have an equal or lesser probability of encountering bugs > compared to a one who used mkfs.btrfs? Just a user and list regular here, and gentoo not debian, but for what it counts... I'd personally never consider or recommend a filesystem converter over the backup, mkfs-to-new-fs, restore-to-new-fs, method, for three reasons. 1) Regardless of how stable a filesystem converter is and what two filesystems the conversion is between, "things" /do/ occasionally happen, thus making it irresponsible to use or recommend use of such a converter without a suitably current and tested backup, "just in case." (This is of course a special case of the sysadmin's first rule of backups, that the true value of data is defined not by any arbitrary claims, but by the number of backups of that data it's considered worth the time/trouble/resources to make/have. If the data value is trivial enough, sure, don't bother with the backup, but if it's of /that/ low a value, so low it's not worth a backup even when doing something as theoretically risky as a filesystem conversion, why is it worth the time and trouble to bother converting it in the first place, instead of just blowing it away and starting clean?) 2) Once a backup is considered "strongly recommended", as we've just established that it should be in 1 regardless of the stability of the converter, just using the existing filesystem as that backup and starting fresh with a mkfs for the new filesystem and copying things over is simply put the easiest, simplest and cleanest method to change filesystems. 3) (Pretty much)[1] Regardless of the filesystems in question, a fresh mkfs and clean sequential transfer of files from the old-fs/backup to the new one is pretty well guaranteed to be better optimized than conversion from an existing filesystem of a different type, particularly one that has been in normal operation for awhile and thus has operational fragmentation of both data and free-space. That's in addition to being less bug-prone, even for a "stable" converter. Restating: So(1) doing a conversion without a backup is irresponsible, (2) the easiest backup and conversion method is directly using the old fs as the backup, and copying over to the freshly mkfs-ed new filesystem, and (3) a freshly mkfs-ed filesystem and sequential copy of files to it from backup, whether that be the old filesystem or not, is going to be more efficient and less bug-prone than an in-place conversion. Given the above, why would /anyone/ /sane/ consider using a converter? It simply doesn't make sense, even if the converter were as stable as the most stable filesystems we have. So as a distro btrfs package maintainer, do what you wish in terms of the converter, but were it me, I might actually consider replacing it with an executable that simply printed out some form of the above argument, with a pointer to the sources should they still be interested after having read that argument.[2] Then, if people really are determined to unnecessarily waste their time to get a less efficient filesystem, possibly risking their data in the process of getting it, they can always build the converter from sources themselves. --- [1] I debated omitting the qualifier as I know of no exceptions, but I'm not a filesystem expert and while I'm a bit skeptical, I suppose it's possible that they might exist. [2] There's actually btrfs precedent for this in the form of the executable built as fsck.btrfs, which does nothing (successfully) but possibly print a message referring people to btrfs check, if run in interactive mode. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: lazytime mount option—no support in Btrfs
Austin S. Hemmelgarn posted on Wed, 22 Aug 2018 07:30:09 -0400 as excerpted: >> Meanwhile, since broken rootflags requiring an initr* came up let me >> take the opportunity to ask once again, does btrfs-raid1 root still >> require an initr*? It'd be /so/ nice to be able to supply the >> appropriate rootflags=device=...,device=... and actually have it work >> so I didn't need the initr* any longer! > Last I knew, specifying appropriate `device=` options in rootflags works > correctly without an initrd. Just to confirm, that's with multi-device btrfs rootfs? Because it used to work when the btrfs was single-device, but not multi-device. (For multi-device, or at least raid1, one had to add degraded, also, or it would refuse to mount despite all the appropriate device= entries in rootflags, thus of course risking all the problems running degraded raid1 operationally can bring, tho I never figured out for sure whether btrfs was smart enough to eventually pick up the other devices, after the scan before bringing other btrfs online or not, but either way it was a risk I wasn't willing to take.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: lazytime mount option—no support in Btrfs
Austin S. Hemmelgarn posted on Tue, 21 Aug 2018 13:01:00 -0400 as excerpted: > Otherwise, the only option for people who want it set is to patch the > kernel to get noatime as the default (instead of relatime). I would > look at pushing such a patch upstream myself actually, if it weren't for > the fact that I'm fairly certain that it would be immediately NACK'ed by > at least Linus, and probably a couple of other people too. What about making default-noatime a kconfig option, presumably set to default-relatime by default? That seems to be the way many legacy- incompatible changes work. Then for most it's up to the distro, which in fact it is already, only if the distro set noatime-default they'd at least be using an upstream option instead of patching it themselves, making it upstream code that could be accounted for instead of downstream code that... who knows? Meanwhile, I'd be interested in seeing your local patch. I'm local- patching noatime-default here too, but not being a dev, I'm not entirely sure I'm doing it "correctly", tho AFAICT it does seem to work. FWIW, here's what I'm doing (posting inline so may be white-space damaged, and IIRC I just recently manually updated the line numbers so they don't reflect the code at the 2014 date any more, but as I'm not sure of the "correctness" it's not intended to be applied in any case): --- fs/namespace.c.orig 2014-04-18 23:54:42.167666098 -0700 +++ fs/namespace.c 2014-04-19 00:19:08.622741946 -0700 @@ -2823,8 +2823,9 @@ long do_mount(const char *dev_name, cons goto dput_out; /* Default to relatime unless overriden */ - if (!(flags & MS_NOATIME)) - mnt_flags |= MNT_RELATIME; + /* JED: Make that noatime */ + if (!(flags & MS_RELATIME)) + mnt_flags |= MNT_NOATIME; /* Separate the per-mountpoint flags */ if (flags & MS_NOSUID) @@ -2837,6 +2837,8 @@ long do_mount(const char *dev_name, cons mnt_flags |= MNT_NOATIME; if (flags & MS_NODIRATIME) mnt_flags |= MNT_NODIRATIME; + if (flags & MS_RELATIME) + mnt_flags |= MNT_RELATIME; if (flags & MS_STRICTATIME) mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME); if (flags & MS_RDONLY) Sane, or am I "doing it wrong!"(TM), or perhaps doing it correctly, but missing a chunk that should be applied elsewhere? Meanwhile, since broken rootflags requiring an initr* came up let me take the opportunity to ask once again, does btrfs-raid1 root still require an initr*? It'd be /so/ nice to be able to supply the appropriate rootflags=device=...,device=... and actually have it work so I didn't need the initr* any longer! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
#x27;s" quotas shouldn't be affected, because that's not what btrfs quotas manage. There are other (non-btrfs) tools for that. >>> In short: values representing quotas are user-oriented ("the numbers >>> one bought"), not storage-oriented ("the numbers they actually >>> occupy"). Btrfs quotas are storage-oriented, and if you're using them, at least directly, for user-oriented, you're using the proverbial screwdriver as a proverbial hammer. > What is VFS disk quotas and does Btrfs use that at all? If not, why not? > It seems to me there really should be a high level basic per directory > quota implementation at the VFS layer, with a single kernel interface as > well as a single user space interface, regardless of the file system. > Additional file system specific quota features can of course have their > own tools, but all of this re-invention of the wheel for basic directory > quotas is a mystery to me. As mentioned above and by others, btrfs quotas don't use vfs quotas (or the reverse, really, it'd be vfs quotas using information exposed by btrfs quotas... if it worked that way), because there's an API mis-match because their intended usage and the information they convey and control is different, and (AFAIK) was never intended or claimed to be the same. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: recover broken partition on external HDD
ime/hassle/resources, before you ever lost the data, and the data loss isn't a big deal because it, by definition of not having a backup, can be of only trivial value not worth the hassle. There's no #3. The data was either defined as worth a backup by virtue of having one, and can be restored from there, or it wasn't, but no big deal because the time/trouble/resources that would have otherwise gone into that backup was defined as more important, and was saved before the data was ever lost in the first place. Thus, while the loss of the data due to fat-fingering (which all sysadmins come to appreciate the real risk of, after a few events of their own) the placement of that ZFS might be a bit of a bother, it's not worth spending huge amounts of time trying to recover, because it was either worth having a backup, in which case you simply recover from it, or it wasn't, in which case it's not worth spending huge amounts to time trying to recover, either. Of course there's still the pre-disaster weighed risk that something will go wrong vs. the post-disaster it DID go wrong, now how do I best get back to normal operation question, but in the context of the backups rule above resolving that question is more a matter of whether it's most efficient to spend a little time trying to recover the existing data with no guarantee of full success, or to simply jump directly into the wipe and restore from known-good (because tested!) backups, which might take more time, but has a (near) 100% chance at recovery to the point of the backup. (The slight chance of failure to recover from tested backups is what multiple levels of backups covers for, with the the value of the data and the weighed risk balanced against the value of the time/hassle/ resources necessary to do that one more level of backup.) So while it might be worth a bit of time to quick-test recovery of the damaged data, it very quickly becomes not worth the further hassle, because either the data was already defined as not worth it due to not having a backup, or restoring from that backup will be faster and less hassle, with a far greater chance of success, than diving further into the data recovery morass, with ever more limited chances of success. Live by that sort of policy from now on, and the results of the next failure, whether it be hardware, software, or wetware (another fat- fingering, again, this is coming from someone, me, who has had enough of their own!), won't be anything to write the list about, unless of course it's a btrfs bug and quite apart from worrying about your data, you're just trying to get it fixed so it won't continue to happen. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS and databases
MegaBrutal posted on Wed, 01 Aug 2018 05:45:15 +0200 as excerpted: > But there is still one question that I can't get over: if you store a > database (e.g. MySQL), would you prefer having a BTRFS volume mounted > with nodatacow, or would you just simply use ext4? > > I know that with nodatacow, I take away most of the benefits of BTRFS > (those are actually hurting database performance – the exact CoW nature > that is elsewhere a blessing, with databases it's a drawback). But are > there any advantages of still sticking to BTRFS for a database albeit > CoW is disabled, or should I just return to the old and reliable ext4 > for those applications? Good question, on which I might expect some honest disagreement on the answer. Personally, I tend to hate nocow with a passion, and would thus recommend putting databases and similar write-pattern (VM images...) files on their own dedicated non-btrfs (ext4, etc) if at all reasonable. But that comes from a general split partition-favoring viewpoint, where doing another partition/lvm-volume and putting a different filesystem on it is no big deal, as it's just one more partition/volume to manage of (likely) several. Some distros/companies/installations have policies strongly favoring btrfs for its "storage pool" features, trying to keep things simple and flexible by using just the one solution and one big btrfs and throwing everything onto it, often using btrfs subvolumes where others would use separate partitions/volumes with independent filesystems. For these folks, the flexibility of being able to throw it all on one filesystem with subvolumes overrides the down sides of having to deal with nocow and its conditions, rules and additional risk. And a big part of that flexibility, along with being a feature in its own right, is btrfs built-in multi-device, without having to resort to an additional multi-device layer such as lvm or mdraid. So if you're using btrfs for multi-device or other features that nocow doesn't affect, it's plausible that you'd prefer nocow on btrfs to /having/ to do partitioning/lvm/mdraid and setup that separate non-btrfs just for your database (or vm image) files. But from your post you're perfectly fine with partitioning and the like already, and won't consider it a heavy imposition to deal with a separate non-btrfs, ext4 or whatever, and in that case, at least here, I'd strongly recommend you do just that, avoiding the nocow that I honestly see as a compromise best left to those that really need it because they aren't prepared to deal with the hassle of setting up the separate filesystem along with all that entails. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: csum failed on raid1 even after clean scrub?
Sterling Windmill posted on Mon, 30 Jul 2018 21:06:54 -0400 as excerpted: > Both drives are identical, Seagate 8TB external drives Are those the "shingled" SMR drives, normally sold as archive drives and first commonly available in the 8TB size, and often bought for their generally better price-per-TB without fully realizing the implications. There have been bugs regarding those drives in the past, and while I believe those bugs were fixed and AFAIK current status is no known SMR- specific bugs, they really are /not/ particularly suited to btrfs usage even for archiving, and definitely not to general usage (that is, pretty much anything but the straight-up archiving use-case they are sold for) use-cases. Of course USB connections are notorious for being unreliable in terms of btrfs usage as well, and I'd really hate to think what a combination of SMR on USB might wreak. If they're not SMR then carry-on! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File permissions lost during send/receive?
Marc Joliet posted on Tue, 24 Jul 2018 22:42:06 +0200 as excerpted: > On my system I get: > > % sudo getcap /bin/ping /sbin/unix_chkpwd > /bin/ping = cap_net_raw+ep > /sbin/unix_chkpwd = cap_dac_override+ep > >> (getcap on unix_chkpwd returns nothing, but while I use kde/plasma I >> don't normally use the lockscreen at all, so for all I know that's >> broken here too.) OK, after remerging pam, I get the same for unix_chkpwd (tho here I have sbin merge so it's /bin/unix_chkpwd with sbin -> bin), so indeed, it must have been the same problem for you with it, that I've simply not run into since whatever killed the filecaps here, because I don't use the lockscreen. But if I start using the lockscreen again and it fails, I know one not-so- intuitive thing to check, now. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File permissions lost during send/receive?
Andrei Borzenkov posted on Tue, 24 Jul 2018 20:53:15 +0300 as excerpted: > 24.07.2018 15:16, Marc Joliet пишет: >> Hi list, >> >> (Preemptive note: this was with btrfs-progs 4.15.1, I have since >> upgraded to 4.17. My kernel version is 4.14.52-gentoo.) >> >> I recently had to restore the root FS of my desktop from backup (extent >> tree corruption; not sure how, possibly a loose SATA cable?). >> Everything was fine, >> even if restoring was slower than expected. However, I encountered two >> files with permission problems, namely: >> >> - /bin/ping, which caused running ping as a normal user to fail due to >> missing permissions, and >> >> - /sbin/unix_chkpwd (part of PAM), which prevented me from unlocking >> the KDE Plasma lock screen; I needed to log into a TTY and run >> "loginctl unlock- session". >> >> Both were easily fixed by reinstalling the affected packages (iputils >> and pam), but I wonder why this happened after restoring from backup. >> >> I originally thought it was related to the SUID bit not being set, >> because of the explanation in the ping(8) man page (section >> "SECURITY"), but cannot find evidence of that -- that is, after >> reinstallation, "ls -lh" does not show the sticky bit being set, or any >> other special permission bits, for that matter: >> >> % ls -lh /bin/ping /sbin/unix_chkpwd >> -rwx--x--x 1 root root 60K 22. Jul 14:47 /bin/ping* >> -rwx--x--x 1 root root 31K 23. Jul 00:21 /sbin/unix_chkpwd* >> >> (Note: no ACLs are set, either.) >> >> > What "getcap /bin/ping" says? You may need to install package providing > getcap (libcap-progs here on openSUSE). sys-libs/libcap on gentoo. Here's what I get: $ getcap /bin/ping /bin/ping = cap_net_raw+ep (getcap on unix_chkpwd returns nothing, but while I use kde/plasma I don't normally use the lockscreen at all, so for all I know that's broken here too.) As hinted, it's almost certainly a problem with filecaps. While I'll freely admit to not fully understanding how file-caps work, and my use- case doesn't use send/receive, I do recall filecaps are what ping uses these days instead of SUID/SGID (on gentoo it'd be iputils' filecaps and possibly caps USE flags controlling this for ping), and also that btrfs send/receive did have a recent bugfix related to the extended-attributes normally used to record filecaps, so the symptoms match the bug and that's probably what you were seeing. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem corruptions with 4.18. git kernels
Alexander Wetzel posted on Fri, 20 Jul 2018 23:28:42 +0200 as excerpted: > A btrfs subvolume is used as the rootfs on a "Samsung SSD 850 EVO mSATA > 1TB" and I'm running Gentoo ~amd64 on a Thinkpad W530. Discard is > enabled as mount option and there were roughly 5 other subvolumes. Regardless of what your trigger problem is, running with the discard mount option considerably increases your risks in at least two ways: 1) Btrfs normally has a feature that tracks old root blocks, which are COWed out at each commit. Should something be wrong with the current one, btrfs can fall back to an older one using the usebackuproot (formerly recovery, but that clashed with the (no)recovery standard option a used on other OSs so they renamed it usebackuproot) mount option. This won't always work, but when it does it's one of the first- line recovery/repair options, as it tends to mean losing only 30-90 seconds (first thru third old roots) worth of writes, while being quite likely to get you the working filesystem as it was at that commit. But once the root goes unused, with discard, it gets marked for discard, and depending on the hardware/firmware implementation, it may be discarded immediately. If it is, that means no backup roots available for recovery should the current root be bad for whatever reason, which pretty well takes out your first and best three chances of a quick fix without much risk. 2) In the past there have been bugs that triggered on discard. AFAIK there are no such known bugs at this time, but in addition to the risk of point one, there is the additional risk of bugs that trigger on discard itself, and due to the nature of the discard feature itself, these sorts of bugs have a much higher chance than normal of being data eating bugs. 3) Depending on the device, the discard mount option may or may not have negative performance implications as well. So while the discard mount option is there, it's definitely not recommended, unless you really are willing to deal with that extra risk and the loss of the backuproot safety-nets, and of course have additionally researched its effects on your hardware to make sure it's not actually slowing you down (which granted, on good mSATA, it may not be, as those are new enough to have a higher likelihood of actually having working queued-trim support). The discard mount option alternative is a scheduled timer/cron job (like the one systemd has, just activate it) that does a periodic (weekly for systemd's timer) fstrim. That lowers the risk to the few commits immediately after the fstrim job runs -- as long as you don't crash during that time, you'll have backup roots available as the current root will have moved on since then, creating backups again as it did so. Or just leave a bit of extra room on the ssd untouched (ideally initially trimmed before partitioning and then left unpartitioned, so the firmware knows its clean and can use it at its convenience), so the ssd can use that extra room to do its wear-leveling, and don't do trim/discard at all. FWIW I actually do both of these here, leaving significant space on the device unpartitioned, and enabling that systemd fstrim timer job, as well. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] 3- and 4- copy RAID1
Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted: >> As implemented in BTRFS, raid1 doesn't have striping. > > The argument is that because there's only two copies, on multi-device > btrfs raid1 with 4+ devices of equal size so chunk allocations tend to > alternate device pairs, it's effectively striped at the macro level, > with the 1 GiB device-level chunks effectively being huge individual > device strips of 1 GiB. > > At 1 GiB strip size it doesn't have the typical performance advantage of > striping, but conceptually, it's equivalent to raid10 with huge 1 GiB > strips/chunks. I forgot this bit... Similarly, multi-device single is regarded by some to be conceptually equivalent to raid0 with really huge GiB strips/chunks. (As you may note, "the argument is" and "regarded by some" are distancing phrases. I've seen the argument made on-list, but while I understand the argument and agree with it to some extent, I'm still a bit uncomfortable with it and don't normally make it myself, this thread being a noted exception tho originally I simply repeated what someone else already said in-thread, because I too agree it's stretching things a bit. But it does appear to be a useful conceptual equivalency for some, and I do see the similarity. Perhaps it's a case of coder's view (no code doing it that way, it's just a coincidental oddity conditional on equal sizes), vs. sysadmin's view (code or not, accidental or not, it's a reasonably accurate high-level description of how it ends up working most of the time with equivalent sized devices).) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] 3- and 4- copy RAID1
Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as excerpted: > On 07/17/2018 11:12 PM, Duncan wrote: >> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as >> excerpted: >> >>> On 07/15/2018 04:37 PM, waxhead wrote: >> >>> Striping and mirroring/pairing are orthogonal properties; mirror and >>> parity are mutually exclusive. >> >> I can't agree. I don't know whether you meant that in the global >> sense, >> or purely in the btrfs context (which I suspect), but either way I >> can't agree. >> >> In the pure btrfs context, while striping and mirroring/pairing are >> orthogonal today, Hugo's whole point was that btrfs is theoretically >> flexible enough to allow both together and the feature may at some >> point be added, so it makes sense to have a layout notation format >> flexible enough to allow it as well. > > When I say orthogonal, It means that these can be combined: i.e. you can > have - striping (RAID0) > - parity (?) > - striping + parity (e.g. RAID5/6) > - mirroring (RAID1) > - mirroring + striping (RAID10) > > However you can't have mirroring+parity; this means that a notation > where both 'C' ( = number of copy) and 'P' ( = number of parities) is > too verbose. Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on top of mirroring or mirroring on top of raid5/6, much as raid10 is conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 on top of raid0. While it's not possible today on (pure) btrfs (it's possible today with md/dm-raid or hardware-raid handling one layer), it's theoretically possible both for btrfs and in general, and it could be added to btrfs in the future, so a notation with the flexibility to allow parity and mirroring together does make sense, and having just that sort of flexibility is exactly why Hugo made the notation proposal he did. Tho a sensible use-case for mirroring+parity is a different question. I can see a case being made for it if one layer is hardware/firmware raid, but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61 (or 15 or 51) might be, where pure mirroring or pure parity wouldn't arguably be a at least as good a match to the use-case. Perhaps one of the other experts in such things here might help with that. >>> Question #2: historically RAID10 is requires 4 disks. However I am >>> guessing if the stripe could be done on a different number of disks: >>> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is >>> that every 64k, the data are stored on a different disk >> >> As someone else pointed out, md/lvm-raid10 already work like this. >> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty >> much works this way except with huge (gig size) chunks. > > As implemented in BTRFS, raid1 doesn't have striping. The argument is that because there's only two copies, on multi-device btrfs raid1 with 4+ devices of equal size so chunk allocations tend to alternate device pairs, it's effectively striped at the macro level, with the 1 GiB device-level chunks effectively being huge individual device strips of 1 GiB. At 1 GiB strip size it doesn't have the typical performance advantage of striping, but conceptually, it's equivalent to raid10 with huge 1 GiB strips/chunks. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] 3- and 4- copy RAID1
Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as excerpted: > On 07/15/2018 04:37 PM, waxhead wrote: > Striping and mirroring/pairing are orthogonal properties; mirror and > parity are mutually exclusive. I can't agree. I don't know whether you meant that in the global sense, or purely in the btrfs context (which I suspect), but either way I can't agree. In the pure btrfs context, while striping and mirroring/pairing are orthogonal today, Hugo's whole point was that btrfs is theoretically flexible enough to allow both together and the feature may at some point be added, so it makes sense to have a layout notation format flexible enough to allow it as well. In the global context, just to complete things and mostly for others reading as I feel a bit like a simpleton explaining to the expert here, just as raid10 is shorthand for raid1+0, aka raid0 layered on top of raid1 (normally preferred to raid01 due to rebuild characteristics, and as opposed to raid01, aka raid0+1, aka raid1 on top of raid0, sometimes recommended as btrfs raid1 on top of whatever raid0 here due to btrfs' data integrity characteristics and less optimized performance), so there's also raid51 and raid15, raid61 and raid16, etc, with or without the + symbols, involving mirroring and parity conceptually at two different levels altho they can be combined in a single implementation just as raid10 and raid01 commonly are. These additional layered-raid levels can be used for higher reliability, with differing rebuild and performance characteristics between the two forms depending on which is the top layer. > Question #1: for "parity" profiles, does make sense to limit the maximum > disks number where the data may be spread ? If the answer is not, we > could omit the last S. IMHO it should. As someone else already replied, btrfs doesn't currently have the ability to specify spread limit, but the idea if we're going to change the notation is to allow for the flexibility in the new notation so the feature can be added later without further notation changes. Why might it make sense to specify spread? At least two possible reasons: a) (stealing an already posted example) Consider a multi-device layout with two or more device sizes. Someone may want to limit the spread in ordered to keep performance and risk consistent as the smaller devices fill up, limiting further usage to a lower number of devices. If that lower number is specified as the spread originally it'll make things more consistent between the room on all devices case and the room on only some devices case. b) Limiting spread can change the risk and rebuild performance profiles. Stripes of full width mean all stripes have a strip on each device, so knock a device out and (assuming parity or mirroring) replace it, and all stripes are degraded and must be rebuilt. With less than maximum spread, some stripes won't be stripped to the replaced device, and won't be degraded or need rebuilt, tho assuming the same overall fill, a larger percentage of stripes that /do/ need rebuilt will be on the replaced device. So the risk profile is more "objects" (stripes/chunks/files) affected but less of each object, or less of the total affected, but more of each affected object. > Question #2: historically RAID10 is requires 4 disks. However I am > guessing if the stripe could be done on a different number of disks: > What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is > that every 64k, the data are stored on a different disk As someone else pointed out, md/lvm-raid10 already work like this. What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much works this way except with huge (gig size) chunks. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
Andrei Borzenkov posted on Fri, 06 Jul 2018 07:28:48 +0300 as excerpted: > 03.07.2018 10:15, Duncan пишет: >> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as >> excerpted: >> >>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет: >>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a >>>> bit dangerous to do it while writes are happening). >>> >>> Could you please elaborate? Do you mean btrfs can trim data before new >>> writes are actually committed to disk? >> >> No. >> >> But normally old roots aren't rewritten for some time simply due to >> odds (fuller filesystems will of course recycle them sooner), and the >> btrfs mount option usebackuproot (formerly recovery, until the >> norecovery mount option that parallels that of other filesystems was >> added and this option was renamed to avoid confusion) can be used to >> try an older root if the current root is too damaged to successfully >> mount. >> But other than simply by odds not using them again immediately, btrfs >> has >> no special protection for those old roots, and trim/discard will >> recover them to hardware-unused as it does any other unused space, tho >> whether it simply marks them for later processing or actually processes >> them immediately is up to the individual implementation -- some do it >> immediately, killing all chances at using the backup root because it's >> already zeroed out, some don't. >> >> > How is it relevant to "while writes are happening"? Will trimming old > tress immediately after writes have stopped be any different? Why? Define "while writes are happening" vs. "immediately after writes have stopped". How soon is "immediately", and does the writes stopped condition account for data that has reached the device-hardware write buffer (so is no longer being transmitted to the device across the bus) but not been actually written to media, or not? On a reasonably quiescent system, multiple empty write cycles are likely to have occurred since the last write barrier, and anything in-process is likely to have made it to media even if software is missing a write barrier it needs (software bug) or the hardware lies about honoring the write barrier (hardware bug, allegedly sometimes deliberate on hardware willing to gamble with your data that a crash won't happen in a critical moment, a somewhat rare occurrence, in ordered to improve normal operation performance metrics). On an IO-maxed system, data and write-barriers are coming down as fast as the system can handle them, and write-barriers become critical -- crash after something was supposed to get to media but didn't, either because of a missing write barrier or because the hardware/firmware lied about the barrier and said the data it was supposed to ensure was on-media was, when it wasn't, and the btrfs atomic-cow commit guarantees of consistent state at each commit go out the window. At this point it becomes useful to have a number of previous "guaranteed consistent state" roots to fall back on, with the /hope/ being that at least /one/ of them is usably consistent. If all but the last one are wiped due to trim... When the system isn't write-maxed the write will have almost certainly made it regardless of whether the barrier is there or not, because there's enough idle time to finish the current write before another one comes down the pipe, so the last-written root is almost certain to be fine regardless of barriers, and the history of past roots doesn't matter even if there's a crash. If "immediately after writes have stopped" is strictly defined as a condition when all writes including the btrfs commit updating the current root and the superblock pointers to the current root have completed, with no new writes coming down the pipe in the mean time that might have delayed a critical update if a barrier was missed, then trimming old roots in this state should be entirely safe, and the distinction between that state and the "while writes are happening" is clear. But if "immediately after writes have stopped" is less strictly defined, then the distinction between that state and "while writes are happening" remains blurry at best, and having old roots around to fall back on in case a write-barrier was missed (for whatever reason, hardware or software) becomes a very good thing. Of course the fact that trim/discard itself is an instruction written to the device in the combined command/data stream complexifies the picture substantially. If those write barriers get missed who knows what state the new root is in, and if the old ones got erased...
Re: unsolvable technical issues?
Austin S. Hemmelgarn posted on Mon, 02 Jul 2018 07:49:05 -0400 as excerpted: > Notably, most Intel systems I've seen have the SATA controllers in the > chipset enumerate after the USB controllers, and the whole chipset > enumerates after add-in cards (so they almost always have this issue), > while most AMD systems I've seen demonstrate the exact opposite > behavior, > they enumerate the SATA controller from the chipset before the USB > controllers, and then enumerate the chipset before all the add-in cards > (so they almost never have this issue). Thanks. That's a difference I wasn't aware of, and would (because I tend to favor amd) explain why I've never seen a change in enumeration order unless I've done something like unplug my sata cables for maintenance and forget which ones I had plugged in where -- random USB stuff left plugged in doesn't seem to matter, even choosing different boot media from the bios doesn't seem to matter by the time the kernel runs (I'm less sure about grub). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted: > 02.07.2018 21:35, Austin S. Hemmelgarn пишет: >> them (trimming blocks on BTRFS gets rid of old root trees, so it's a >> bit dangerous to do it while writes are happening). > > Could you please elaborate? Do you mean btrfs can trim data before new > writes are actually committed to disk? No. But normally old roots aren't rewritten for some time simply due to odds (fuller filesystems will of course recycle them sooner), and the btrfs mount option usebackuproot (formerly recovery, until the norecovery mount option that parallels that of other filesystems was added and this option was renamed to avoid confusion) can be used to try an older root if the current root is too damaged to successfully mount. But other than simply by odds not using them again immediately, btrfs has no special protection for those old roots, and trim/discard will recover them to hardware-unused as it does any other unused space, tho whether it simply marks them for later processing or actually processes them immediately is up to the individual implementation -- some do it immediately, killing all chances at using the backup root because it's already zeroed out, some don't. In the context of the discard mount option, that can mean there's never any old roots available ever, as they've already been cleaned up by the hardware due to the discard option telling the hardware to do it. But even not using that mount option, and simply doing the trims periodically, as done weekly by for instance the systemd fstrim timer and service units, or done manually if you prefer, obviously potentially wipes the old roots at that point. If the system's effectively idle at the time, not much risk as the current commit is likely to represent a filesystem in full stasis, but if there's lots of writes going on at that moment *AND* the system happens to crash at just the wrong time, before additional commits have recreated at least a bit of root history, again, you'll potentially be left without any old roots for the usebackuproot mount option to try to fall back to, should it actually be necessary. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send/receive vs rsync
Marc MERLIN posted on Fri, 29 Jun 2018 09:24:20 -0700 as excerpted: >> If instead of using a single BTRFS filesystem you used LVM volumes >> (maybe with Thin provisioning and monitoring of the volume group free >> space) for each of your servers to backup with one BTRFS filesystem per >> volume you would have less snapshots per filesystem and isolate >> problems in case of corruption. If you eventually decide to start from >> scratch again this might help a lot in your case. > > So, I already have problems due to too many block layers: > - raid 5 + ssd - bcache - dmcrypt - btrfs > > I get occasional deadlocks due to upper layers sending more data to the > lower layer (bcache) than it can process. I'm a bit warry of adding yet > another layer (LVM), but you're otherwise correct than keeping smaller > btrfs filesystems would help with performance and containing possible > damage. > > Has anyone actually done this? :) So I definitely use (and advocate!) the split-em-up strategy, and I use btrfs, but that's pretty much all the similarity we have. I'm all ssd, having left spinning rust behind. My strategy avoids unnecessary layers like lvm (tho crypt can arguably be necessary), preferring direct on-device (gpt) partitioning for simplicity of management and disaster recovery. And my backup and recovery strategy is an equally simple mkfs and full-filesystem-fileset copy to an identically sized filesystem, with backups easily bootable/mountable in place of the working copy if necessary, and multiple backups so if disaster takes out the backup I was writing at the same time as the working copy, I still have a backup to fall back to. So it's different enough I'm not sure how much my experience will help you. But I /can/ say the subdivision is nice, as it means I can keep my root filesystem read-only by default for reliability, my most-at-risk log filesystem tiny for near-instant scrub/balance/check, and my also at risk home small as well, with the big media files being on a different filesystem that's mostly read-only, so less at risk and needing less frequent backups. The tiny boot and large updates (distro repo, sources, ccache) are also separate, and mounted only for boot maintenance or updates. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
cause while a "regular user" may not know it because it's not his /job/ to know it, if there's anything an admin knows *well* it's that the working copy of data **WILL** be damaged. It's not a matter of if, but of when, and of whether it'll be a fat-finger mistake, or a hardware or software failure, or wetware (theft, ransomware, etc), or wetware (flood, fire and the water that put it out damage, etc), tho none of that actually matters after all, because in the end, the only thing that matters was how the value of that data was defined by the number of backups made of it, and how quickly and conveniently at least one of those backups can be retrieved and restored. Meanwhile, an admin worth the label will also know the relative risk associated with various options they might use, including nocow, and knowing that downgrades the stability rating of the storage approximately to the same degree that raid0 does, they'll already be aware that in such a case the working copy can only be defined as "throw-away" level in case of problems in the first place, and will thus not even consider their working copy to be a permanent copy at all, just a temporary garbage copy, only slightly more reliable than one stored on tmpfs, and will thus consider the first backup thereof the true working copy, with an additional level of backup beyond what they'd normally have thrown in to account for that fact. So in case of problems people can simply restore nocow files from a near- line stable working copy, much as they'd do after reboot or a umount/ remount cycle for a file stored in tmpfs. And if they didn't have even a stable working copy let alone a backup... well, much like that file in tmpfs, what did they expect? They *really* defined that data as of no more than trivial value, didn't they? All that said, making the NOCOW warning labels a bit more bold print couldn't hurt; and making scrub in the nocow case at least compare copies and report differences, simply makes it easier for people to know they need to reach for that near-line stable working copy, or mkfs and start from scratch if they defined the data value as not worth the trouble of (in this case) even a stable working copy, let alone a backup, so that'd be a good thing too. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
Hugo Mills posted on Mon, 25 Jun 2018 16:54:36 + as excerpted: > On Mon, Jun 25, 2018 at 06:43:38PM +0200, waxhead wrote: > [snip] >> I hope I am not asking for too much (but I know I probably am), but I >> suggest that having a small snippet of information on the status page >> showing a little bit about what is either currently the development >> focus , or what people are known for working at would be very valuable >> for users and it may of course work both ways, such as exciting people >> or calming them down. ;) >> >> For example something simple like a "development focus" list... >> 2018-Q4: (planned) Renaming the grotesque "RAID" terminology >> 2018-Q3: (planned) Magical feature X >> 2018-Q2: N-Way mirroring >> 2018-Q1: Feature work "RAID"5/6 >> >> I think it would be good for people living their lives outside as it >> would perhaps spark some attention from developers and perhaps even >> media as well. > > I started doing this a couple of years ago, but it turned out to be > impossible to keep even vaguely accurate or up to date, without going > round and bugging the developers individually on a per-release basis. I > don't think it's going to happen. In addition, anything like quarter, kernel cycle, etc, has been repeatedly demonstrated to be entirely broken beyond "current", because roadmapped tasks have rather consistently taken longer, sometimes /many/ /times/ longer (by a factor of 20+ in the case of raid56), than first predicted. But in theory it might be double, with just a roughly ordered list, no dates beyond "current focus", and with suitably big disclaimers about other things (generally bugs in otherwise more stable features, but occasionally a quick sub-feature that is seen to be easier to introduce at the current state than it might be later, etc) possibly getting priority and temporarily displacing roadmapped items. In fact, this last one is the big reason why raid56 has taken so long to even somewhat stabilize -- the devs kept finding bugs in already semi- stable features that took priority... for kernel cycle after kernel cycle. The quotas/qgroups feature, already introduced and intended to be at least semi-stable was one such culprit, requiring repeated rewrite and kernel cycles worth of bug squashing. A few critical under the right circumstances compression bugs, where compression was supposed to be an already reasonably stable feature, were another, tho these took far less developer bandwidth than quotas. Getting a reasonably usable fsck was a bunch of little patches. AFAIK that one wasn't actually an original focus and was intended to be back-burnered for some time, but once btrfs hit mainline, users started demanding it, so the priority was bumped. And of course having it has been good for finding and ultimately fixing other bugs as well, so it wasn't a bad thing, but the hard fact is the repairing fsck has taken, all told, I'd guess about the same number of developer cycles as quotas, and those developer cycles had to come from stuff that had been roadmapped for earlier. As a bit of an optimist I'd be inclined to argue that OK, we've gotten btrfs in far better shape general stability-wise now, and going forward, the focus can be back on the stuff that was roadmapped for earlier that this stuff displaced, so one might hope things will move faster again now, but really, who knows? That's arguably what the devs thought when they mainlined btrfs, too, and yet it took all this much longer to mature and stabilize since then. Still, it /has/ to happen at /some/ point, right? And I know for a fact that btrfs is far more stable now than it was... because things like ungraceful shutdowns that used to at minimum trigger (raid1 mode) scrub fixes on remount and scrub, now... don't -- btrfs is now stable enough that the atomic COW is doing its job and things "just work", where before, they required scrub repair at best, and occasional blow away and restore from backups. So I can at least /hope/ that the worst of the plague of bugs is behind us, and people can work on what they intended to do most (say 80%) of the time now, spending say a day's worth a week (20%) on bugs, instead of the reverse, 80% (4 days a week) on bugs and if they're lucky, a day a week on what they were supposed to be focused on, which is what we were seeing for awhile. Plus the tools to do the debugging, etc, are far more mature now, another reason bugs should hopefully take less time now. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as excerpted: > On 2018-06-24 16:22, Goffredo Baroncelli wrote: >> On 06/23/2018 07:11 AM, Duncan wrote: >>> waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted: >>> >>>> According to this: >>>> >>>> https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , >>>> section 1.2 >>>> >>>> It claims that BTRFS still have significant technical issues that may >>>> never be resolved. >>> >>> I can speculate a bit. >>> >>> 1) When I see btrfs "technical issue that may never be resolved", the >>> #1 first thing I think of, that AFAIK there are _definitely_ no plans >>> to resolve, because it's very deeply woven into the btrfs core by now, >>> is... >>> >>> [1)] Filesystem UUID Identification. Btrfs takes the UU bit of >>> Universally Unique quite literally, assuming they really *are* >>> unique, at least on that system[.] Because >>> btrfs uses this supposedly unique ID to ID devices that belong to the >>> filesystem, it can get *very* mixed up, with results possibly >>> including dataloss, if it sees devices that don't actually belong to a >>> filesystem with the same UUID as a mounted filesystem. >> >> As partial workaround you can disable udev btrfs rules and then do a >> "btrfs dev scan" manually only for the device which you need. > You don't even need `btrfs dev scan` if you just specify the exact set > of devices in the mount options. The `device=` mount option tells the > kernel to check that device during the mount process. Not that lvm does any better in this regard[1], but has btrfs ever solved the bug where only one device= in the kernel commandline's rootflags= would take effect, effectively forcing initr* on people (like me) who would otherwise not need them and prefer to do without them, if they're using a multi-device btrfs as root? Not to mention the fact that as kernel people will tell you, device enumeration isn't guaranteed to be in the same order every boot, so device=/dev/* can't be relied upon and shouldn't be used -- but of course device=LABEL= and device=UUID= and similar won't work without userspace, basically udev (if they work at all, IDK if they actually do). Tho in practice from what I've seen, device enumeration order tends to be dependable /enough/ for at least those without enterprise-level numbers of devices to enumerate. True, it /does/ change from time to time with a new kernel, but anybody sane keeps a tested-dependable old kernel around to boot to until they know the new one works as expected, and that sort of change is seldom enough that users can boot to the old kernel and adjust their settings for the new one as necessary when it does happen. So as "don't do it that way because it's not reliable" as it might indeed be in theory, in practice, just using an ordered /dev/* in kernel commandlines does tend to "just work"... provided one is ready for the occasion when that device parameter might need a bit of adjustment, of course. > Also, while LVM does have 'issues' with cloned PV's, it fails safe (by > refusing to work on VG's that have duplicate PV's), while BTRFS fails > very unsafely (by randomly corrupting data). And IMO that "failing unsafe" is both serious and common enough that it easily justifies adding the point to a list of this sort, thus my putting it #1. >>> 2) Subvolume and (more technically) reflink-aware defrag. >>> >>> It was there for a couple kernel versions some time ago, but >>> "impossibly" slow, so it was disabled until such time as btrfs could >>> be made to scale rather better in this regard. > I still contend that the biggest issue WRT reflink-aware defrag was that > it was not optional. The only way to get the old defrag behavior was to > boot a kernel that didn't have reflink-aware defrag support. IOW, > _everyone_ had to deal with the performance issues, not just the people > who wanted to use reflink-aware defrag. Absolutely. Which of course suggests making it optional, with a suitable warning as to the speed implications with lots of snapshots/reflinks, when it does get enabled again (and as David mentions elsewhere, there's apparently some work going into the idea once again, which potentially moves it from the 3-5 year range, at best, back to a 1/2-2-year range, time will tell). >>> 3) N-way-mirroring. >>> >> [...] >> This is not an issue, but a not implemented feature > If you're looking
Re: unsolvable technical issues?
, since it'll use some of that code", since at least 3.5, when raid56 was supposed to be introduced in 3.6. I know because this is the one I've been most looking forward to personally, tho my original reason, aging but still usable devices that I wanted extra redundancy for, has long since itself been aged out of rotation. Of course we know the raid56 story and thus the implied delay here, if it's even still roadmapped at all now, and as with reflink-aware-defrag, there's no hint yet as to when we'll actually see this at all, let alone see it in a reasonably stable form, so at least in the practical sense, it's arguably "might never be resolved." 4) (Until relatively recently, and still in terms of scaling) Quotas. Until relatively recently, quotas could arguably be added to the list. They were rewritten multiple times, and until recently, appeared to be effectively eternally broken. While that has happily changed recently and (based on the list, I don't use 'em personally) quotas actually seem at least someone usable these days (altho less critical bugs are still being fixed), AFAIK quota scalability while doing btrfs maintenance remains a serious enough issue that the recommendation is to turn them off before doing balances, and the same would almost certainly apply to reflink-aware-defrag (turn quotas off before defraging) were it available, as well. That scalability alone could arguably be a "technical issue that may never be resolved", and while quotas themselves appear to be reasonably functional now, that could arguably justify them still being on the list. And of course that's avoiding the two you mentioned, tho arguably they could go on the "may in practice never be resolved, at least not in the non-bluesky lifetime" list as well. As for stratis, supposedly they're deliberately taking existing proven in multi-layer-form technology and simply exposing it in unified form. They claim this dramatically lessens the required new code and shortens time- to-stability to something reasonable, in contrast to the about a decade btrfs has taken already, without yet reaching a full feature set and full stability. IMO they may well have a point, tho AFAIK they're still new and immature themselves and (I believe) don't have it either, so it's a point that AFAIK has yet to be fully demonstrated. We'll see how they evolve. I do actually expect them to move faster than btrfs, but also expect the interface may not be as smooth and unified as they'd like to present as I expect there to remain some hiccups in smoothing over the layering issues. Also, because they've deliberately chosen to go with existing technology where possible in ordered to evolve to stability faster, by the same token they're deliberately limiting the evolution to incremental over existing technology, and I expect there's some stuff btrfs will do better as a result... at least until btrfs (or a successor) becomes stable enough for them to integrate (parts of?) it as existing demonstrated-stable technology. The other difference, AFAIK, is that stratis is specifically a corporation making it a/the main money product, whereas btrfs was always something the btrfs devs used at their employers (oracle, facebook), who have other things as their main product. As such, stratis is much more likely to prioritize things like raid status monitors, hot-spares, etc, that can be part of the product they sell, where they've been lower priority for btrfs. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56
Gandalf Corvotempesta posted on Wed, 20 Jun 2018 11:15:03 +0200 as excerpted: > Il giorno mer 20 giu 2018 alle ore 10:34 Duncan <1i5t5.dun...@cox.net> > ha scritto: >> Parity-raid is certainly nice, but mandatory, especially when there's >> already other parity solutions (both hardware and software) available >> that btrfs can be run on top of, should a parity-raid solution be >> /that/ necessary? > > You can't be serious. hw raid as much more flaws than any sw raid. I didn't say /good/ solutions, I said /other/ solutions. FWIW, I'd go for mdraid at the lower level, were I to choose, here. But for a 4-12-ish device solution, I'd probably go btrfs raid1 on a pair of mdraid-0s. That gets you btrfs raid1 data integrity and recovery from its other mirror, while also being faster than the still not optimized btrfs raid 10. Beyond about a dozen devices, six per "side" of the btrfs raid1, the risk of multi-device breakdown before recovery starts to get too high for comfort, but six 8 TB devices in raid0 gives you up to 48 TB to work with, and more than that arguably should be broken down into smaller blocks to work with in any case, because otherwise you're simply dealing with so much data it'll take you unreasonably long to do much of anything non-incremental with it, from any sort of fscks or btrfs maintenance, to trying to copy or move the data anywhere (including for backup/restore purposes), to ... whatever. Actually, I'd argue that point is reached well before 48 TB, but the point remains, at some point it's just too much data to do much of anything with, too much to risk losing all at once, too much to backup and restore all at once as it just takes too much time to do it, just too much... And that point's well within ordinary raid sizes with a dozen devices or less, mirrored, these days. Which is one of the reasons I'm so skeptical about parity-raid being mandatory "nowadays". Maybe it was in the past, when disks were (say) half a TB or less and mirroring a few TB of data was resource- prohibitive, but now? Of course we've got a guy here who works with CERN and deals with their annual 50ish petabytes of data (49 in 2016, see wikipedia's CERN article), but that's simply problems on a different scale. Even so, I'd say it needs broken up into manageable chunks, and 50 PB is "only" a bit over 1000 48 TB filesystems worth. OK, say 2000, so you're not filling them all absolutely full. Meanwhile, I'm actually an N-way-mirroring proponent, here, as opposed to a parity-raid proponent. And at that sort of scale, you /really/ don't want to have to restore from backups, so 3-way or even 4-5 way mirroring makes a lot of sense. Hmm... 2.5 dozen for 5-way-mirroring, 2000 times, 2.5*12*2000=... 60K devices! That's a lot of hard drives! And a lot of power to spin them. But I guess it's a rounding error compared to what CERN uses for the LHC. FWIW, N-way-mirroring has been on the btrfs roadmap, since at least kernel 3.6, for "after raid56". I've been waiting awhile too; no sign of it yet so I guess I'll be waiting awhile longer. So as they say, "welcome to the club!" I'm 51 now. Maybe I'll see it before I die. Imagine, I'm in my 80s in the retirement home and get the news btrfs finally has N-way-mirroring in mainline. I'll be jumping up and down and cause a ruckus when I break my hip! Well, hoping it won't be /that/ long, but... =;^] -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance did not progress after 12H
Austin S. Hemmelgarn posted on Tue, 19 Jun 2018 12:58:44 -0400 as excerpted: > That said, I would question the value of repacking chunks that are > already more than half full. Anything above a 50% usage filter > generally takes a long time, and has limited value in most cases (higher > values are less likely to reduce the total number of allocated chunks). > With `-duszge=50` or less, you're guaranteed to reduce the number of > chunk if at least two match, and it isn't very time consuming for the > allocator, all because you can pack at least two matching chunks into > one 'new' chunk (new in quotes because it may re-pack them into existing > slack space on the FS). Additionally, `-dusage=50` is usually sufficient > to mitigate the typical ENOSPC issues that regular balancing is supposed > to help with. While I used to agree, 50% for best efficiency, perhaps 66 or 70% if you're really pressed for space, now that the allocator can repack into existing chunks more efficiently than it used to (at least in ssd mode, which all my storage is now), I've seen higher values result in practical/ noticeable recovery of space to unallocated as well. In fact, I routinely use usage=70 these days, and sometimes use higher, to 99 or even 100%[1]. But of course I'm on ssd so it's far faster, and partition it up with the biggest partitions being under 100 GiB, so even full unfiltered balances are normally under 10 minutes and normal filtered balances under a minute, to the point I usually issue the balance command and actually wait for completion, so it's a far different ball game than issuing a balance command on a multi-TB hard drive and expecting it to take hours or even days. In that case, yeah, a 50% cap arguably makes sense, tho he was using 60, which still shouldn't (sans bugs like we seem to have here) be /too/ bad. --- [1] usage=100: -musage=1..100 is the only way I've found to balance metadata without rebalancing system as well, with the unfortunate penalty for rebalancing system on small filesystems being an increase of the system chunk size from 8 MB original mkfs.btrfs size to 32 MB... only a few KiB used! =:^( -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56
on that I'm not sure has been settled yet. > Based on official BTRFS status page, RAID56 is the only "unstable" item > marked in red. > No interested from Suse in fixing that? As the above should make clear, it's _not_ a question as simple as "interest"! > I think it's the real missing part for a feature-complete filesystem. > Nowadays parity raid is mandatory, we can't only rely on mirroring. "Nowdays"? "Mandatory"? Parity-raid is certainly nice, but mandatory, especially when there's already other parity solutions (both hardware and software) available that btrfs can be run on top of, should a parity-raid solution be /that/ necessary? Of course btrfs isn't the only next-gen fs out there, either, there's other solutions such as zfs available too, if btrfs doesn't have the features required at the maturity required. So I'd like to see the supporting argument to parity-raid being mandatory for btrfs, first, before I'll take it as a given. Nice, sure. Mandatory? Call me skeptical. --- [1] "Still cautious" use: In addition to the raid56-specific reliability issues described above, as well as to cover Waxhead's referral to my usual backups advice: Sysadmin's[2] first rule of data value and backups: The real value of your data is not defined by any arbitrary claims, but rather by how many backups you consider it worth having of that data. No backups simply defines the data as of such trivial value that it's worth less than the time/trouble/resources necessary to do and have at least one level of backup. With such a definition, data loss can never be a big deal, because even in the event of data loss, what was defined as of most importance, the time/trouble/resources necessary to have a backup (or at least one more level of backup, in the event there were backups but they failed too), was saved. So regardless of whether the data was recoverable or not, you *ALWAYS* save what you defined as most important, either the data if you had a backup to retrieve it from, or the time/trouble/resources necessary to make that backup, if you didn't have it because saving that time/ trouble/resources was considered more important than making that backup. Of course the sysadmin's second rule of backups is that it's not a backup, merely a potential backup, until you've tested that you can actually recover the data from it in similar conditions to those under which you'd need to recover it. IOW, boot to the backup or to the recovery environment, and be sure the backup's actually readable and can be recovered from using only the resources available in the recovery environment, then reboot back to the normal or recovered environment and be sure that what you recovered from the recovery environment is actually bootable or readable in the normal environment. Once that's done, THEN it can be considered a real backup. "Still cautious use" is simply ensuring that you're following the above rules, as any good admin will be regardless, and that those backups are actually available and recoverable in a timely manner should that be necessary. IOW, an only backup "to the cloud" that's going to take a week to download and recover to, isn't "still cautious use", if you can only afford a few hours down time. Unfortunately, that's a real life scenario I've seen people say they're in here more than once. [2] Sysadmin: As used here, "sysadmin" simply refers to the person who has the choice of btrfs, as compared to say ext4, in the first place, that is, the literal admin of at least one system, regardless of whether that's administering just their own single personal system, or thousands of systems across dozens of locations in some large corporation or government institution. [3] Raid56 mode reliability implications: For raid56 data, this isn't /that/ big of a deal, tho depending on what's in the rest of the stripe, it could still affect files not otherwise written in some time. For metadata, however, it's a huge deal, since an incorrectly reconstructed metadata stripe could take out much or all of the filesystem, depending on what metadata was actually in that stripe. This is where waxhead's recommendation to use raid1/10 for metadata even if using raid56 for data comes in. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches
Marc Lehmann posted on Wed, 06 Jun 2018 21:06:35 +0200 as excerpted: > Not sure what exactly you mean with btrfs mirroring (there are many > btrfs features this could refer to), but the closest thing to that that > I use is dup for metadata (which is always checksummed), data is always > single. All btrfs filesystems are on lvm (not mirrored), and most (but > not all) are encrypted. One affected fs is on a hardware raid > controller, one is on an ssd. I have a single btrfs fs in that box with > raid1 for metadata, as an experiment, but I haven't used it for testing > yet. On the off chance, tho it doesn't sound like it from your description... You're not doing LVM snapshots of the volumes with btrfs on them, correct? Because btrfs depends on filesystem GUIDs being just that, globally unique, using them to find the possible multiple devices of a multi-device btrfs (normal single-device filesystems don't have the issue as they don't have to deal with multi-device as btrfs does), and btrfs can get very confused, with data-loss potential, if it sees multiple copies of a device with the same filesystem GUID, as can happen if lvm snapshots (which obviously have the same filesystem GUID as the original) are taken and both the snapshot and the source are exposed to btrfs device scan (which is auto-triggered by udev when the new device appears), with one of them mounted. Presumably you'd consider lvm snapshotting a form of mirroring and you've already said you're not doing that in any form, but just in case, because this is a rather obscure trap people using lvm could find themselves in, without a clue as to the danger, and the resulting symptoms could be rather hard to troubleshoot if this possibility wasn't considered. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID-1 refuses to balance large drive
Brad Templeton posted on Sun, 27 May 2018 11:22:07 -0700 as excerpted: > BTW, I decided to follow the original double replace strategy suggested -- > replace 6TB with 8TB and replace 4TB with 6TB. That should be sure to > leave the 2 large drives each with 2TB free once expanded, and thus able > to fully use all space. > > However, the first one has been going for 9 hours and is "189.7% done" > and still going. Some sort of bug in calculating the completion > status, obviously. With luck 200% will be enough? IIRC there was an over-100% completion status bug fixed, I'd guess about 18 months to two years ago now, long enough it would have slipped regular's minds so nobody would have thought about it even knowing you're still on 4.4, that being one of the reasons we don't do as well supporting stuff that old. If it is indeed the same bug, anything even half modern should have it fixed -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID-1 refuses to balance large drive
try to support, it's the last two kernel release series in each of the current and LTS tracks. So as the first release back from current 4.16, 4.15, tho EOLed upstream, is still reasonably supported for the moment here, tho people should be upgrading to 4.16 by now as 4.17 should be out in a couple weeks or so and 4.15 would be out of the two-current-kernel-series window at that time. Meanwhile, the two latest LTS series are as already stated 4.14, and the earlier 4.9. 4.4 is the one previous to that and it's still mainline supported in general, but it's out of the two LTS-series window of best support here, and truth be told, based on history, even supporting the second newest LTS series starts to get more difficult at about a year and a half out, 6 months or so before the next LTS comes out. As it happens that's about where 4.9 is now, and 4.14 has had about 6 months to stabilize now, so for LTS I'd definitely recommend 4.14, now. Of course that doesn't mean that we /refuse/ to support 4.4, we still try, but it's out of primary focus now and in many cases, should you have problems, the first recommendation is going to be try something newer and see if the problem goes away or presents differently. Or as mentioned, check with your distro if it's a distro kernel, since in that case they're best positioned to support it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: csum failed root raveled during balance
ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted: >> IMHO the best course of action would be to disable checksumming for you >> vm files. >> >> > Do you mean '-o nodatasum' mount flag? Is it possible to disable > checksumming for singe file by setting some magical chattr? Google > thinks it's not possible to disable csums for a single file. You can use nocow (-C), but of course that has other restrictions (like setting it on the files when they're zero-length, easiest done for existing data by setting it on the containing dir and copying files (no reflink) in) as well as the nocow effects, and nocow becomes cow1 after a snapshot (which locks the existing copy in place so changes written to a block /must/ be written elsewhere, thus the cow1, aka cow the first time written after the snapshot but retain the nocow for repeated writes between snapshots). But if you're disabling checksumming anyway, nocow's likely the way to go. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: property: Set incompat flag of lzo/zstd compression
Su Yue posted on Tue, 15 May 2018 16:05:01 +0800 as excerpted: > > On 05/15/2018 03:51 PM, Misono Tomohiro wrote: >> Incompat flag of lzo/zstd compression should be set at: >> 1. mount time (-o compress/compress-force) >> 2. when defrag is done 3. when property is set >> >> Currently 3. is missing and this commit adds this. >> >> > If I don't misunderstand, compression property of an inode is only apply > for *the* inode, not the whole filesystem. > So the original logical should be okay. But the inode is on the filesystem, and if it's compressed with lzo/zstd, the incompat flag should be set to avoid mounting with an earlier kernel that doesn't understand that compression and would therefore, if we're lucky, simply fail to read the data compressed in that file/inode. (If we're unlucky it could blow up with kernel memory corruption like James Harvey's current case of unexpected, corrupted compressed data in a nocow file that being nocow, doesn't have csum validation to fail and abort the decompression, and shouldn't be compressed at all.) So better to set the incompat flag and refuse to mount at all on kernels that don't have the required compression support. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted
Darrick J. Wong posted on Fri, 11 May 2018 17:06:34 -0700 as excerpted: > On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote: >> Right now we return EINVAL if a process does not have permission to dedupe a >> file. This was an oversight on my part. EPERM gives a true description of >> the nature of our error, and EINVAL is already used for the case that the >> filesystem does not support dedupe. >> >> Signed-off-by: Mark Fasheh >> --- >> fs/read_write.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/read_write.c b/fs/read_write.c >> index 77986a2e2a3b..8edef43a182c 100644 >> --- a/fs/read_write.c >> +++ b/fs/read_write.c >> @@ -2038,7 +2038,7 @@ int vfs_dedupe_file_range(struct file *file, struct >> file_dedupe_range *same) >> info->status = -EINVAL; >> } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) || >> uid_eq(current_fsuid(), dst->i_uid))) { >> -info->status = -EINVAL; >> +info->status = -EPERM; > > Hmm, are we allowed to change this aspect of the kabi after the fact? > > Granted, we're only trading one error code for another, but will the > existing users of this care? xfs_io won't and I assume duperemove won't > either, but what about bees? :) >From the 0/2 cover-letter: >>> This has also popped up in duperemove, mostly in the form of cryptic >>> error messages. Because this is a code returned to userspace, I did >>> check the other users of extent-same that I could find. Both 'bees' >>> and 'rust-btrfs' do the same as duperemove and simply report the error >>> (as they should). > --D > >> } else if (file->f_path.mnt != dst_file->f_path.mnt) { >> info->status = -EXDEV; >> } else if (S_ISDIR(dst->i_mode)) { >> -- >> 2.15.1 >> -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56 - 6 parity raid
Goffredo Baroncelli posted on Wed, 02 May 2018 22:40:27 +0200 as excerpted: > Anyway, my "rant" started when Ducan put near the missing of parity > checksum and the write hole. The first might be a performance problem. > Instead the write hole could lead to a loosing data. My intention was to > highlight that the parity-checksum is not related to the reliability and > safety of raid5/6. Thanks for making that point... and to everyone else for the vigorous thread debating it, as I'm learning quite a lot! =:^) >From your first reply: >> Why the fact that the parity is not checksummed is a problem ? >> I read several times that this is a problem. However each time the >> thread reached the conclusion that... it is not a problem. I must have missed those threads, or at least, missed that conclusion from them (maybe believing they were about something rather narrower, or conflating... for instance), because AFAICT, this is the first time I've seen the practical merits of checksummed parity actually debated, at least in terms I as a non-dev can reasonably understand. To my mind it was settled (or I'd have worded my original claim rather differently) and only now am I learning different. And... to my credit... given the healthy vigor of the debate, it seems I'm not the only one that missed them... But I'm surely learning of it now, and indeed, I had somewhat conflated parity-checksumming with the in-place-stripe-read-modify-write atomicity issue. I'll leave the parity-checksumming debate (now that I know it at least remains debatable) to those more knowledgeable than myself, but in addition to what I've learned of it, I've definitely learned that I can't properly conflate it with the in-place stripe-rmw atomicity issue, so thanks! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56 - 6 parity raid
Gandalf Corvotempesta posted on Wed, 02 May 2018 19:25:41 + as excerpted: > On 05/02/2018 03:47 AM, Duncan wrote: >> Meanwhile, have you looked at zfs? Perhaps they have something like >> that? > > Yes, i've looked at ZFS and I'm using it on some servers but I don't > like it too much for multiple reasons, in example: > > 1) is not officially in kernel, we have to build a module every time > with DKMS FWIW zfz is excluded from my choice domain as well, due to the well known license issues. Regardless of strict legal implications, because Oracle has copyrights they could easily solve that problem and the fact that they haven't strongly suggests they have no interest in doing so. That in turn means they have no interest in people like me running zfs, which means I have no interest in it either. But because it does remain effectively the nearest to btrfs features and potential features "working now" solution out there, for those who simply _must_ have it and/or find it a more acceptable solution than cobbling together a multi-layer solution out of a standard filesystem on top of device-mapper or whatever, it's what I and others point to when people wonder about missing or unstable btrfs features. > I'm new to BTRFS (if fact, i'm not using it) and I've seen in the status > page that "it's almost ready". > The only real missing part is a stable, secure and properly working > RAID56, > so i'm thinking why most effort aren't directed to fix RAID56 ? Well, they are. But finding and fixing corner-case bugs takes time and early-adopter deployments, and btrfs doesn't have the engineering resources to simply assign to the problem that Sun had with zfs. Despite that, as I stated, current btrfs raid56 is, to the best of my/ list knowledge, the current code is now reasonably ready, tho it'll take another year or two without serious bug reports to actually test that, but it simply has the well known write hole that applies to all parity- raid unless they've taken specific measures such as partial-stripe-write logging (slow), writing a full stripe even if it's partially empty (wastes space and needs periodic maintenance to reclaim it), or variable- stripe-widths (needs periodic maintenance and more complex than always writing full stripes even if they're partially empty) (both of the latter avoiding the problem by avoiding in-place read-modify-write cycle entirely). So to a large degree what's left is simply time for testing to demonstrate stability on the one hand, and a well known problem with parity-raid in general on the other. There's the small detail that said well-known write hole has additional implementation-detail implications on btrfs, but at it's root it's the same problem all parity-raid has, and people choosing parity-raid as a solution are already choosing to either live with it or ameliorate it in some other way (tho some parity-raid solutions have that amelioration built-in). > There are some environments where a RAID1/10 is too expensive and a > RAID6 is mandatory, > but with the current state of RAID56, BTRFS can't be used for valuable > data Not entirely true. Btrfs, even btrfs raid56 mode, _can_ be used for "valuable" data, it simply requires astute /practical/ definitions of "valuable", as opposed to simple claims that don't actually stand up in practice. Here's what I mean: The sysadmin's first rule of backups defines "valuable data" by the number of backups it's worth making of that data. If there's no backups, then by definition the data is worth less than the time/hassle/resources necessary to have that backup, because it's not a question of if, but rather when, something's going to go wrong with the working copy and it won't be available any longer. Additional layers of backup and whether one keeps geographically separated off-site backups as well are simply extensions of the first- level-backup case/rule. The more valuable the data, the more backups it's worth having of it, and the more effort is justified in ensuring that single or even multiple disasters aren't going to leave no working backup. With this view, it's perfectly fine to use btrfs raid56 mode for "valuable" data, because that data is backed up and that backup can be used as a fallback if necessary. True, the "working copy" might not be as reliable as it is in some cases, but statistically, that simply brings the 50% chance of failure rate (or whatever other percentage chance you choose) closer, to say once a year, or once a month, rather than perhaps once or twice a decade. Working copy failure is GOING to happen in any case, it's just a matter of playing the chance
Re: RAID56 - 6 parity raid
Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 + as excerpted: > Hi to all I've found some patches from Andrea Mazzoleni that adds > support up to 6 parity raid. > Why these are wasn't merged ? > With modern disk size, having something greater than 2 parity, would be > great. 1) Btrfs parity-raid was known to be seriously broken until quite recently (and still has the common parity-raid write-hole, which is more serious on btrfs because btrfs otherwise goes to some lengths to ensure data/metadata integrity via checksumming and verification, and the parity isn't checksummed, risking even old data due to the write hole, but there are a number of proposals to fix that), and piling even more not well tested patches on top was _not_ the way toward a solution. 2) Btrfs features in general have taken longer to merge and stabilize than one might expect, and parity-raid has been a prime example, with the original roadmap calling for parity-raid merge back in the 3.5 timeframe or so... partial/runtime (not full recovery) code was finally merged ~3 years later in (IIRC) 3.19, took several development cycles for the initial critical bugs to be worked out but by 4.2 or so was starting to look good, then more bugs were found and reported, that took several more years to fix, tho IIRC LTS-4.14 has them. Meanwhile, consider that N-way-mirroring was fast-path roadmapped for "right after raid56 mode, because some of its code depends on that), so was originally expected in 3.6 or so... As someone who had been wanting to use /that/, I personally know the pain of "still waiting". And that was "fast-pathed". So even if the multi-way-parity patches were on the "fast" path, it's only "now" (for relative values of now, for argument say by 4.20/5.0 or whatever it ends up being called) that such a thing could be reasonably considered. 3) AFAIK none of the btrfs devs have flat rejected the idea, but btrfs remains development opportunity rich and implementing dev poor... there's likely 20 years or more of "good" ideas out there. And the N-way-parity- raid patches haven't hit any of the current devs' (or their employers') "personal itch that needs to be scratched" interest points, so while it certainly does remain a "nice idea", given the implementation timeline history for even 'fast-pathed" ideas, realistically we're looking at at least a decade out. But with the practical projection horizon no more than 5-7 years out (beyond that other, unpredicted, developments, are likely to change things so much that projection is effectively impossible), in practice, a decade out is "bluesky", aka "it'd be nice to have someday, but it's not a priority, and with current developer manpower, it's unlikely to happen any time in the practically projectable future. 4) Of course all that's subject to no major new btrfs developer (or sponsor) making it a high priority, but even should such a developer (and/ or sponsor) appear, they'd probably need to spend at least two years coming up to speed with the code first, fixing normal bugs and improving the existing code quality, then post the updated and rebased N-way-parity patches for discussion, and get them roadmapped for merge probably some years later due to other then-current project feature dependencies. So even if the N-way-parity patches became some new developer's (or sponsor's) personal itch to scratch, by the time they came up to speed and the code was actually merged, there's no realistic projection that it would be in under 5 years, plus another couple to stabilize, so at least 7 years to properly usable stability. So even then, we're already at the 5-7 years practical projectability limit. Meanwhile, have you looked at zfs? Perhaps they have something like that? And there's also a new(?) one, stratis, AFAIK commercially sponsored and device-mapper based, that I saw an article on recently, tho I've seen/heard no kernel-community discussion on it (there's a good chance followup here will change that if it's worth discussing, as there's several folks here for whom knowing about such things is part of their job) and no other articles (besides the pt 1 of the series mentioned below), so for all I know it's pie-in-the-sky or still new enough it'd be 5-7 years before it can be used in practice, as well. But assuming it's a viable project, presumably it would get support if device- mapper did/has. The stratis article I saw (apparently part 2 in a series): https://opensource.com/article/18/4/stratis-lessons-learned -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe SSD + compression - benchmarking
Brendan Hide posted on Sat, 28 Apr 2018 09:30:30 +0200 as excerpted: > My real worry is that I'm currently reading at 2.79GB/s (see result > above and below) without compression when my hardware *should* limit it > to 2.0GB/s. This tells me either `sync` is not working or my benchmark > method is flawed. No answer but a couple additional questions/suggestions: * Tarfile: Just to be sure, you're using an uncompressed tarfile, not a (compressed tarfile) tgz/tbz2/etc, correct? * How does hdparm -t and -T compare? That's read-only and bypasses the filesystem, so it should at least give you something to compare the 2.79 GB/s to, both from-raw-device (-t) and cached/memory-only (-T). See the hdparm (8) manpage for the details. * And of course try the compressed tarball too, since it should be easy enough and should give you compressable vs. uncompressable numbers for sanity checking. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is recommended level of btrfs-progs and kernel please
David C. Partridge posted on Sat, 28 Apr 2018 15:09:07 +0100 as excerpted: > To what level of btrfs-progs do you recommend I should upgrade once my > corrupt FS is fixed? What is the kernel pre-req for that? > > Would prefer not to build from source ... currently running Ubuntu > 16.04LTS The way it works is as follows: In normal operation, the kernel does most of the work, with commands such as balance and scrub simply making the appropriate calls to the kernel to do the real work. So the kernel version is what's critical in normal operation. (IIRC, the receive side of btrfs send/receive is an exception, userspace is doing the work there, tho the kernel does it on the send side.) This list is mainline and forward-looking development focused, so recommended kernels, the ones people here are most familiar with, tend to be relatively new. The two support tracks are current and LTS, and we try to support the latest two kernels of each. On the current kernel track, 4.16 is the latest, so the 4.16 and 4.15 series are currently supported. On the LTS track, 4.14 is the newest LTS series and is recommended, with 4.9 the previous one, still supported, tho as it gets older and memories of what was going on at the time fade, it gets harder to support. That doesn't mean we don't try to help people with older kernels, but truth is, the best answer may well be "try it with a newer kernel and see if the problem persists". Similarly for distro kernels, particularly older ones. We track mainline and in general[1] have little idea what patches specific distros may have backported... or not. With newer kernels there's not so much to backport, and hopefully none of their added patches actually interferes, but particularly outside the mainline LTS series kernels, and older than the second newest LTS series kernel for the real LTS distros, the distros themselves are choosing what to backport and support, and thus are in a better position to support those kernels than we on this list will be. But when something goes wrong and you need to use the debugging tools or btrfs check or restore, it's the btrfs userspace (btrfs-progs) that is doing the work, so it becomes the most critical when you have a problem you are trying to find/repair/restore-from. So in normal operation, userspace isn't critical, and the biggest problem is simply keeping it current enough that the output remains comparable to current output. With btrfs userspace release numbering following that of the kernel, for operational use, a good rule of thumb is to keep userspace updated to at least the version of the oldest supported LTS kernel series, as mentioned 4.9 at present, thus keeping it at least within approximately two years of current. But once something goes wrong, the newest available userspace, or close to it, has the latest fixes, and generally provides the best chance at a fix with the least hassle or chance of further breakage instead. So there, basically something within the current track, above, thus currently at least a 4.15 if not a 4.16 userspace (btrfs-progs) is your best bet. And often the easiest way to get that if your distro doesn't make it directly available, is to make it a point to keep around the latest LiveRescue (often install/rescue combined) image of a distro such as Fedora or Arch that stays relatively current. That's often the newest or close enough, and if it's not, it at least gives you a way to get back online to fetch something newer after booting the rescue image, if you have to. --- [1] In general: I think one regular btrfs dev works with SuSE, and one non-dev but well-practiced support list regular is most familiar with Fedora, tho of course Fedora doesn't to be /too/ outdated. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: status page
ith the least chance of introducing new bugs so the testing and bugfixing cycle should be shorter as well, but ouch, that logged-write penalty on top of the read-modify- write penalty that short-stripe-writes on parity-raid already incurs, will really do a number to performance! But it /should/ finally fix the write hole risk, and it'd be the fastest way to do it on top of existing code, with the least risk of additional bugs because it's the least new code to write. What I personally suspect will happen is this last solution in the shorter term, tho it'll still take some years to be written and tested to stability, with the possibility of someone undertaking a btrfs parity- raid-g2 project implementing the first/cleanest possibility in the longer term, say a decade out (which effectively means "whenever someone with the skills and motivation decides to try it, could be 5 years out if they start today and devote the time to it, could be 15 years out, or never, if nobody ever decides to do it). I honestly don't see the intermediate possibilities as worth the trouble, as they'd take too long for not enough payback compared to the solutions at either end, but of course, someone might just come along that likes and actually implements that angle instead. As always with FLOSS, the one actually doing the implementation is the one who decides (subject to maintainer veto, of course, and possible distro and ultimate mainlining of the de facto situation override of the maintainer, as well). A single paragraph summary answer? Current raid56 status-quo is semi-stable, and subject to testing over time, is likely to remain there for some time, with the known parity-raid write-hole caveat as the biggest issue. There's discussion of attempts to mitigate the write-hole, but the final form such mitigation will take remains to be settled, and the shortest-to-stability alternative, logged partial-stripe-writes, has serious performance negatives, but that might be acceptable given that parity-raid already has read-modify-write performance issues so people don't choose it for write performance in any case. That'd be probably 3 years out to stability at the earliest. There's a cleaner alternative but it'd be /much/ farther out as it'd involve a pretty heavy rewrite along with the long testing and bugfix cycle that implies, so ~10 years out if ever, for that. And there's a couple intermediate alternatives as well, but unless something changes I don't really see them going anywhere. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs progs release 4.16.1
David Sterba posted on Wed, 25 Apr 2018 13:02:34 +0200 as excerpted: > On Wed, Apr 25, 2018 at 06:31:20AM +0000, Duncan wrote: >> David Sterba posted on Tue, 24 Apr 2018 13:58:57 +0200 as excerpted: >> >> > btrfs-progs version 4.16.1 have been released. This is a bugfix >> > release. >> > >> > Changes: >> > >> > * remove obsolete tools: btrfs-debug-tree, btrfs-zero-log, >> > btrfs-show-super, btrfs-calc-size >> >> Cue the admin-side gripes about developer definitions of micro-upgrade >> explicit "bugfix release" that allow disappearance of "obsolete tools". >> >> Arguably such removals can be expected in a "feature release", but >> shouldn't surprise unsuspecting admins doing a micro-version upgrade >> that's specifically billed as a "bugfix release". > > A major version release would be a better time for the removal, I agree > and should have considered that. > > However, the tools have been obsoleted for a long time (since 2015 or > 2016) so I wonder if the deprecation warnings have been ignored by the > admins all the time. Indeed, in practice, anybody still using the stand-alone tools in a current version has been ignoring deprecation warnings for awhile, and the difference between 4.16.1 and 4.17(.0) isn't likely to make much of a difference to them. It's just that from here anyway, if I did a big multi-version upgrade and saw tools go missing I'd expect it, and if I did an upgrade from 4.16 to 4.17 I'd expect it and blame myself for not getting with the program sooner. But on an upgrade from 4.16 to 4.16.1, furthermore, an explicit "bugfix release", I'd be annoyed with upstream when they went missing, because it's just not expected in such a minor release, particularly when it's an explicit "bugfix release". >> (Further support for btrfs being "still stabilizing, not yet fully >> stable and mature." But development mode habits need to end >> /sometime/, if stability is indeed a goal.) > > What happened here was a bad release management decision, a minor one in > my oppinion but I hear your complaint and will keep that in mind for > future releases. That's all I was after. A mere trifle indeed in the filesystem context where there's a real chance that bugs can eat data, but equally trivially held off for a .0 release. What's behind is done, but it can and should be used to inform the future, and I simply mentioned it here with the goal /of/ informing future release decisions. To the extent that it does so, my post accomplished its purpose. =:^) Seems my way of saying that ended up coming across way more negative than intended. So I have some changes to make in the way I handle things in the future as well. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs progs release 4.16.1
David Sterba posted on Tue, 24 Apr 2018 13:58:57 +0200 as excerpted: > btrfs-progs version 4.16.1 have been released. This is a bugfix > release. > > Changes: > > * remove obsolete tools: btrfs-debug-tree, btrfs-zero-log, > btrfs-show-super, btrfs-calc-size Cue the admin-side gripes about developer definitions of micro-upgrade explicit "bugfix release" that allow disappearance of "obsolete tools". Arguably such removals can be expected in a "feature release", but shouldn't surprise unsuspecting admins doing a micro-version upgrade that's specifically billed as a "bugfix release". (Further support for btrfs being "still stabilizing, not yet fully stable and mature." But development mode habits need to end /sometime/, if stability is indeed a goal.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovery from full metadata with all device space consumed?
adata, RAID1: total=3.00GiB, used=2.50GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> All of the consumable space on the backing devices also seems to be in >> use: >> >> # btrfs fi show /broken Label: 'mon_data' uuid: >> 85e52555-7d6d-4346-8b37-8278447eb590 >> Total devices 4 FS bytes used 69.50GiB >> devid1 size 931.51GiB used 931.51GiB path /dev/sda1 >> devid2 size 931.51GiB used 931.51GiB path /dev/sdb1 >> devid3 size 931.51GiB used 931.51GiB path /dev/sdc1 >> devid4 size 931.51GiB used 931.51GiB path /dev/sdd1 As you suggest, all space on all devices is used. While fi usage breaks out unallocated as its own line-item, both per device and overall, with fi show/df you have to derive it from the difference between size and used on each device listed in the fi show report. If (after getting it that way with balance) you keep fi show per-device used under say 250 or 500 MiB, that'll go to unallocated, as fi usage will make clearer. Meanwhile, for fi df, that data line says 3.6+ TiB total data chunk allocations, but only 67 GiB used. As I said, that's ***WAY*** out of whack, and getting it back into something a bit more normal and keeping it there, for under 100 GiB actually used, say under say 250 or 500 GiB total, with the rest returned to unallocated, dropping the used in the fi df report and increasing unallocated in fi usage, should keep you well out of trouble. As for fi usage, While I use a bunch of much smaller filesystems here, all raid1 or dup, so it'll be of limited direct help, I'll post the output from one of mine, just so you can see how much easier it is to read the fi usage report: $$ sudo btrfs filesystem usage / Overall: Device size: 16.00GiB Device allocated: 7.02GiB Device unallocated:8.98GiB Device missing: 0.00B Used: 4.90GiB Free (estimated): 5.25GiB (min: 5.25GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 16.00MiB (used: 0.00B) Data,RAID1: Size:3.00GiB, Used:2.24GiB /dev/sda5 3.00GiB /dev/sdb5 3.00GiB Metadata,RAID1: Size:512.00MiB, Used:209.59MiB /dev/sda5 512.00MiB /dev/sdb5 512.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/sda5 8.00MiB /dev/sdb5 8.00MiB Unallocated: /dev/sda5 4.49GiB /dev/sdb5 4.49GiB (FWIW there's also btrfs device usage, if you want a device-focused report.) This is a btrfs raid1 both data and metadata, on a pair of 8 GiB devices, thus 16 GiB total. Of that 8 GiB per device, a very healthy 4.49 GiB per device, over half the filesystem, remains entirely chunk-level unallocated and thus free to allocate to data or metadata chunks as needed. Meanwhile, data chunk allocation is 3 GiB total per device, of which 2.24 GiB is used. Again, that's healthy, as data chunks are nominally 1 GiB so that's probably three 1 GiB chunks allocated, with 2.24 GiB of it used. By contrast, your in-trouble fi usage report will show (near) 0 unallocated and a ***HUGE*** gap between size/total and used for data, while you should be easily able to get per-device data totals down to say 250 GiB or so (or down to 10 GiB or so with more work), with it all switching to unallocated, and then keep it healthy by doing a balance with -dusage= as necessary any time the numbers start getting out of line again. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: remounted ro during operation, unmountable since
Qu Wenruo posted on Sat, 14 Apr 2018 22:41:50 +0800 as excerpted: >> sectorsize 4096 >> nodesize 4096 > > Nodesize is not the default 16K, any reason for this? > (Maybe performance?) > >>> 3) Extra hardware info about your sda >>> Things like SMART and hardware model would also help here. >> Model Family: Samsung based SSDs Device Model: SAMSUNG SSD 830 >> Series > > At least I haven't hear much problem about Samsung SSD, so I don't think > it's the hardware to blamce. (Unlike Intel 600P) 830 model is a few years old, IIRC (I have 850s, and I think I saw 860s out in something I read probably on this list, but am not sure of it). I suspect the filesystem was created with an old enough btrfs-tools that the default nodesize was still 4K, either due to older distro, or simply due to using the filesystem that long. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs fails to mount after power outage
Qu Wenruo posted on Thu, 12 Apr 2018 07:25:15 +0800 as excerpted: > On 2018年04月11日 23:33, Tom Vincent wrote: >> My btrfs laptop had a power outage and failed to boot with "parent >> transid verify failed..." errors. (I have backups). > > Metadata corruption, again. > > I'm curious about what's the underlying disk? > Is it plain physical device? Or have other layers like bcache/lvm? > > And what's the physical device? SSD or HDD? The last line of his message said progs 4.15, kernel 4.15.15, NVMe, so it's SSD. Another important question, tho, if not for this instance, than for easiest repair the next time something goes wrong: What mount options? In particular, is the discard option used (and of course I'm assuming nothing as insane as nobarrier)? Because as came up on a recent thread here... Btrfs normally keeps a few generations of root blocks around and one method of recovery is using the usebackuproot (or the deprecated recovery) option to try to use them if the current root is bad. But apparently nobody considered how discard and the backup roots would interact, and there's (currently) nothing keeping them from being marked for discard just as soon as the next new root becomes current. Now some device firmware batches up discards as garbage-collection that can be done periodically, when the number of unwritten erase-blocks gets low, but others do discards basically immediately, meaning those backup roots are lost effectively immediately, making the usebackuproots recovery feature worthless. =:^( Not a tradeoff that would occur to most people, obviously including the btrfs devs that setup btrfs discard behavior, considering whether to enable discard or not. =:^( But it's definitely a tradeoff to consider once you /do/ know it! Presumably that'll be fixed at some point, but not being a dev nor knowing how complex the fix might be, I won't venture a guess as to when, or whether it'd be considered stable-kernel backport material or not, when it happens. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Out of space and incorrect size reported
Shane Walton posted on Thu, 22 Mar 2018 00:56:05 + as excerpted: >>>> btrfs fi df /mnt2/pool_homes >>> Data, RAID1: total=240.00GiB, used=239.78GiB >>> System, RAID1: total=8.00MiB, used=64.00KiB >>> Metadata, RAID1: total=8.00GiB, used=5.90GiB >>> GlobalReserve, single: total=512.00MiB, used=59.31MiB >>> >>>> btrfs filesystem show /mnt2/pool_homes >>> Label: 'pool_homes' uuid: 0987930f-8c9c-49cc-985e-de6383863070 >>> Total devices 2 FS bytes used 245.75GiB >>> devid1 size 465.76GiB used 248.01GiB path /dev/sdaThe output from the (relatively new and thus possibly not yet in the old 4.4 you posted with above and upgraded from) btrfs filesystem usage command makes this somewhat clearer, tho >>> devid2 size 465.76GiB used 248.01GiB path /dev/sdb >>> >>> Why is the line above "Data, RAID1: total=240.00GiB, used=239.78GiB >>> almost full and limited to 240 GiB when there is I have 2x 500 GB HDD? >>> What can I do to make this larger or closer to the full size of 465 >>> GiB (minus the System and Metadata overhead)? By my read, Hugo answered correctly, but (I think) not the question you asked. The upgrade was certainly a good idea, 4.4 being quite old now and not really supported well here now, as this is a development list and we tend to be focused on new, not long ago history, but it didn't change the report output as you expected, because based on your question you're misreading it and it doesn't say what you are interpreting it as saying. BTW, you might like the output from btrfs filesystem usage a bit better as it's somewhat clearer than the previously required (usage is a relatively new subcommand that might not have been in 4.4 yet) btrfs fi df and btrfs fi show, but understanding how btrfs works and what the reported numbers mean is still useful. Btrfs does two-stage allocation. First, it allocates chunks of a specific type, normally data or metadata (system is special, normally only one chunk so no more allocated, and global reserve is actually reserved from metadata and counts as part of it) from unused/unallocated space (which isn't shown by show/df, but usage shows it separately), then when necessary, btrfs actually uses space from the chunks it allocated previously. So what the above df line is saying is that 240 GiB of space have been allocated as data chunks, and 239.78 GiB of that, almost all of it, is used. But you should still have 200+ GiB of unallocated space on each of the devices, as here shown by the individual device lines of the show command (465 total, 248 used), tho as I said, btrfs filesystem usage makes that rather clearer. And btrfs should normally allocate additional space from that 200+ gigs unallocated, to data or metadata chunks, as necessary. Further, because btrfs can't directly take chunks allocated as data and reallocate them as metadata, you *WANT* lots of unallocated space. You do NOT want all that extra space allocated as data chunks, because then they wouldn't be available to allocate as metadata if needed. Now with 200+ GiB of space on each of the two devices unallocated, you shouldn't yet be running into ENOSPC (error no space) errors. If you are, that's a bug, and there have actually been a couple bugs like that recently, but that doesn't mean you want btrfs to unnecessarily allocate all that unallocated space as data space, which would be what it did if it reported all that as data. Rather, you need btrfs to allocate data, and metadata, chunks as needed, and any space related errors you are seeing would be bugs related to that. Now that you have a newer btrfs-progs and kernel, and have read my attempt at an explanation above, try btrfs filesystem usage and see if things are clearer. If not, maybe Hugo or someone else can do better now, answering /that/ question. And of course if with the newer 4.12 kernel you're getting ENOSPC errors, please report that too, tho be aware that 4.14 is the latest LTS series, with 4.9 the LTS before that, and as a normal non-LTS series kernel 4.12 support has ended as well, so you might wish to either upgrade to a current 4.14 LTS or downgrade to the older 4.9 LTS, for best support. Or of course you could go with a current non-LTS. Normally the latest two release series in both normal and LTS are best supported, so with 4.15 out and 4.16 nearing release, that's the latest 4.15 stable release now, or 4.14, to be 4.16 and 4.15 at 4.16 release, or on the LTS track the previously mentioned 4.14 and 4.9 series, tho at a year old plus, 4.9 is already getting rather harder to support, and 4.14 is old enough now it's preferred for LTS track. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree progra
Re: grub_probe/grub-mkimage does not find all drives in BTRFS RAID1
ut bad upgrades or fat-fingering my /boot, that I kept it! But in addition to two-way raid1 redundancy on multiple devices, btrfs has the dup mode, two-way dup redundancy on a single device, so that's what I do with my /boot and its backups on other devices now, instead of making them raid1s across multiple devices. So while most of my filesystems and their backups are btrfs raid1 both data and metadata across two physical devices (with another pair of physical devices for the btrfs raid1 backups), /boot and its backups are all btrfs dup mixed-bg-mode (so data and metadata mixed, easier to work with on small filesystems), giving me one primary /boot and three backups, and I can still select which one to boot from the hardware/BIOS (legacy not EFI mode, tho I do use GPT and have EFI-boot partitions reserved in case I decide to switch to EFI at some point). So my suggestion would be to do something similar, multiple /boot, one per device, one as the working copy and the other(s) as backups, instead of btrfs raid1 across multiple devices. If you still want to take advantage of btrfs' ability to error-correct from a second copy if the first fails checksum, as I do, btrfs dup mode is useful, but regardless, you'll then have a backup in case the working /boot entirely fails. Tho of course with dup mode you can only use a bit under half the capacity. Your btrfs fi show says 342 MB used (as data) of the 1 GiB, so dup mode should be possible as you'd have a bit under 500 MiB capacity then. Your individual devices say nearly 700 MiB each used, but with only 342 MiB of that as data, the rest is likely partially used chunks that a filtered balance can take care of. A btrfs fi usage report would tell the details (or btrfs fi df, combined with the show you've already posted). At a GiB, creating the filesystem as mixed-mode is also recommended, tho that does make a filtered balance a bit more of a hassle since you have to use the same filters for both data and metadata because they're the same chunks. FWIW, I started out with 256 MiB /boot here, btrfs dup mode so ~ 100 MiB usable, but after ssd upgrades and redoing the layout, now use 512 MiB /boots, for 200+ MiB usable. That's better. Your 1 GiB doubles that, so should be no trouble at all, even with dup, unless you're storing way more in /boot than I do. (Being gentoo I do configure and run a rather slimmer custom initramfs and monolithic kernel configured for only the hardware and dracut initr* modules I need, and a fatter generic initr* and kernel modules would likely need more space, but your show output says it's only using 342 MiB for data, so as I said your 1 GiB for ~500 MiB usable in dup mode should be quite reasonable.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid1 volume stuck as read-only: How to dump, recreate and restore its content?
Piotr Pawłow posted on Tue, 13 Mar 2018 08:08:27 +0100 as excerpted: > Hello, >> Put differently, 4.7 is missing a year and a half worth of bugfixes >> that you won't have when you run it to try to check or recover that >> btrfs that won't mount! Do you *really* want to risk your data on bugs >> that were after all discovered and fixed over a year ago? > > It is also missing newly introduced bugs. Right now I'm dealing with > btrfs raid1 server that had the fs getting stuck and kernel oopses due > to a regression: > > https://bugzilla.kernel.org/show_bug.cgi?id=198861 > > I had to cherry-pick commit 3be8828fc507cdafe7040a3dcf361a2bcd8e305b and > recompile the kernel to even start moving the data off the failing > drive, as the fix is not in stable yet, and encountering any i/o error > would break the kernel. And now it seems the fs is corrupted, maybe due > to all the crashes earlier. > > FYI in case you decide to switch to 4.15 In context I was referring to userspace as the 4.7 was userspace btrfs- progs, not kernelspace. For kernelspace he was on 4.9, which is the second-newest LTS (long-term- stable) kernel series, and thus should continue to be at least somewhat supported on this list for another year or so, as we try to support the two newest kernels from both the current and LTS series. Tho 4.9 does lack the newer raid1 per-chunk degraded-writable scanning feature, and AFAIK that won't be stable-backported as it's more a feature than a bugfix and as such, doesn't meet the requirements for stable-series backports. Which is why Adam recommended a newer kernel, since that was the particular problem needing addressed here. But for someone on an older kernel, presumably because they like stability, I'd suggest the newer 4.14 LTS series kernel as an upgrade, not the only short-term supported 4.15 series... unless the intent is to continue staying current after that, with 4.16, 4.17, etc. Which your point about newer kernels coming with newer bugs in addition to fixes supports as well. Moving to the 4.14 LTS should get the real fixes and the longer stabilization time, tho not the feature adds, which would bring a higher chance of more bugs, as well. And with 4.15 out for awhile now and 4.16 close, 4.14 should be reasonably stabilizing by now and should be pretty safe to move to. But of course there's some risk of new bugs in addition to fixes for newer userspace versions too. But since it's kernelspace that's the operational code and userspace is primarily recovery, and we know that older bugs ARE fixed in newer userspace, and assuming a sane backups policy which I stressed in the same post (if you don't have a backup, you're defining the data as of less value than the time/trouble/resources to create the backup, thus defining it as of relatively low/trivial value in the first place, because you're more willing to risk losing it than you are to spend the time/resources/hassle to ensure against that risk), the better chance at an updated userspace being able to fix problems with less risk of further damage really does justify considering updating to reasonably current userspace. If there's any doubt, stay a version or two behind the latest release and watch for reports of problems with the latest, but certainly, with 4.15 userspace out and no serious reports of new damage from 4.14 userspace, the latter should now be a reasonably safe upgrade. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid1 volume stuck as read-only: How to dump, recreate and restore its content?
ckups are fast enough now that as I predicted, I make them far more often. So I'm walking my own talk, and am able to sleep much more comfortably now as I'm not worrying about that backup I put off and the chance fate might take me up on my formerly too-high-for-comfort "trivial" threshold definition.=:^) (And as it happens, I'm actually running from a system/root filesystem backup ATM, as an upgrade didn't go well and x wouldn't start, so I reverted. But my root/system filesystem is under 10 gigs, on SSD for the backup as well as the working copy, so a full backup copy of root takes only a few minutes and I made one before upgrading a few packages I had some doubts about due to previous upgrade issues with them, so the delta between working and that backup was literally the five package upgrades I was it turned out rightly worried about. So that investment in ssds for backup has paid off. While in this particular case simply taking a snapshot and recovering to it when the upgrade went bad would have worked just as well, having the independent filesystem backup on a different set of physical devices means I don't have to worry about loss of the filesystem or physical devices containing it, either! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to replace a failed drive in btrfs RAID 1 filesystem
Andrei Borzenkov posted on Sat, 10 Mar 2018 13:27:03 +0300 as excerpted: > And "missing" is not the answer because I obviously may have more than > one missing device. "missing" is indeed the answer when using btrfs device remove. See the btrfs-device manpage, which explains that if there's more than one device missing, either just the first one described by the metadata will be removed (if missing is only specified once), or missing can be specified multiple times. raid6 with two devices missing is the only normal candidate for that presently, tho on-list we've seen aborted-add cases where it still worked as well, because while the metadata listed the new device it didn't actually have any data when it became apparent it was bad and thus needed to be removed again. Note that because btrfs raid1 and raid10 only does two-way-mirroring regardless of the number of devices, and because of the per-chunk (as opposed to per-device) nature of btrfs raid10, those modes can only expect successful recovery with a single missing device, altho as mentioned above we've seen on-list at least one case where an aborted device-add of device found to be bad after the add didn't actually have anything on it, so it could still be removed along with the device it was originally intended to replace. Of course the N-way-mirroring mode, whenever it eventually gets implemented, will allow missing devices upto N-1, and N-way-parity mode, if it's ever implemented, similar, but N-way-mirroring was scheduled for after raid56 mode so it could make use of some of the same code, and that has of course taken years on years to get merged and stabilize, and there's no sign yet of N-way-mirroring patches, which based on the raid56 case could take years to stabilize and debug after original merge, so the still somewhat iffy raid6 mode is likely to remain the only normal usage of multiple missing for years, yet. For btrfs replace, the manpage says ID's the only way to handle missing, but getting that ID, as you've indicated, could be difficult. For filesystems with only a few devices that haven't had any or many device config changes, it should be pretty easy to guess (a two device filesystem with no changes should have IDs 1 and 2, so if only one is listed, the other is obvious, and a 3-4 device fs with only one or two previous device changes, likely well remembered by the admin, should still be reasonably easy to guess), but as the number of devices and the number of device adds/removes/replaces increases, finding/guessing the missing one becomes far more difficult. Of course the sysadmin's first rule of backups states in simple form that not having one == defining the value of the data as trivial, not worth the trouble of a backup, which in turn means that at some point before there's /too/ many device change events, it's likely going to be less trouble (particularly after factoring in reliability) to restore from backups to a fresh filesystem than it is to do yet another device change, and together with the current practical limits btrfs imposes on the number of missing devices, that tends to impose /some/ limit on the possibilities for missing device IDs, so the situation, while not ideal, isn't yet /entirely/ out of hand, either, because a successful guess based on available information should be possible without /too/ many attempts. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: spurious full btrfs corruption
Christoph Anton Mitterer posted on Tue, 06 Mar 2018 01:57:58 +0100 as excerpted: > In the meantime I had a look of the remaining files that I got from the > btrfs-restore (haven't run it again so far, from the OLD notebook, so > only the results from the NEW notebook here:): > > The remaining ones were multi-GB qcow2 images for some qemu VMs. > I think I had non of these files open (i.e. VMs running) while in the > final corruption phase... but at least I'm sure that not *all* of them > were running. > > However, all the qcow2 files from the restore are more or less garbage. > During the btrfs-restore it already complained on them, that it would > loop too often on them and whether I want to continue or not (I choose n > and on another full run I choose y). > > Some still contain a partition table, some partitions even filesystems > (btrfs again)... but I cannot mount them. Just a note on format choices FWIW, nothing at all to do with your current problem... As my own use-case doesn't involve VMs I'm /far/ from an expert here, but if I'm screwing things up I'm sure someone will correct me and I'll learn something too, but it does /sound/ reasonable, so assuming I'm remembering correctly from a discussion here... Tip: Btrfs and qcow2 are both copy-on-write/COW (it's in the qcow2 name, after all), and doing multiple layers of COW is both inefficient and a good candidate to test for corner-case bugs that wouldn't show up in more normal use-cases. Assuming bug-free it /should/ work properly, of course, but equally of course, bug-free isn't an entirely realistic assumption. =8^0 ... And you're putting btrfs on qcow2 on btrfs... THREE layers of COW! The recommendation was thus to pick what layer you wish to COW at, and use something that's not COW-based at the other layers. Apparently, qemu has raw-format as a choice as well as qcow2, and that was recommended as preferred for use with btrfs (and IIRC what the recommender was using himself). But of course that still leaves cow-based btrfs on both the top and the bottom layers. I suppose which of those is best to remain btrfs, while making the other say ext4 as widest used and hopefully safest general purpose non-COW alternative, depends on the use-case. Of course keeping btrfs at both levels but nocowing the image files on the host btrfs is a possibility as well, but nocow on btrfs has enough limits and caveats that I consider it a second-class "really should have used a different filesystem for this but didn't want to bother setting up a dedicated one" choice, and as such, don't consider it a viable option here. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs space used issue
vinayak hegde posted on Thu, 01 Mar 2018 14:56:46 +0530 as excerpted: > This will happen over and over again until we have completely > overwritten the original extent, at which point your space usage will go > back down to ~302g.We split big extents with cow, so unless you've got > lots of space to spare or are going to use nodatacow you should probably > not pre-allocate virt images Indeed. Preallocation with COW doesn't make the sense it does on an overwrite-in-place filesystem. Either nocow it and take the penalties that brings[1], or configure your app not to preallocate in the first place[2]. --- [1] On btrfs, nocow implies no checksumming or transparent compression, either. Also, the nocow attribute needs to be set on the empty file, with the easiest way to do that being to set it on the parent directory before file creation, so it's inherited by any newly created files/ subdirs within it. [2] Many apps that preallocate by default have an option to turn preallocation off. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs space used issue
Austin S. Hemmelgarn posted on Wed, 28 Feb 2018 14:24:40 -0500 as excerpted: >> I believe this effect is what Austin was referencing when he suggested >> the defrag, tho defrag won't necessarily /entirely/ clear it up. One >> way to be /sure/ it's cleared up would be to rewrite the entire file, >> deleting the original, either by copying it to a different filesystem >> and back (with the off-filesystem copy guaranteeing that it can't use >> reflinks to the existing extents), or by using cp's --reflink=never >> option. >> (FWIW, I prefer the former, just to be sure, using temporary copies to >> a suitably sized tmpfs for speed where possible, tho obviously if the >> file is larger than your memory size that's not possible.) > Correct, this is why I recommended trying a defrag. I've actually never > seen things so bad that a simple defrag didn't fix them however (though > I have seen a few cases where the target extent size had to be set > higher than the default of 20MB). Good to know. I knew larger target extent sizes could help, but between not being sure they'd entirely fix it and not wanting to get too far down into the detail when the copy-off-the-filesystem-and-back option is /sure/ to fix the problem, I decided to handwave that part of it. =:^) > Also, as counter-intuitive as it > might sound, autodefrag really doesn't help much with this, and can > actually make things worse. I hadn't actually seen that here, but suspect I might, now, as previous autodefrag behavior on my system tended to rewrite the entire file[1], thereby effectively giving me the benefit of the copy-away-and-back technique without actually bothering, while that "bug" has now been fixed. I sort of wish the old behavior remained an option, maybe radicalautodefrag or something, and must confess to being a bit concerned over the eventual impact here now that autodefrag does /not/ rewrite the entire file any more, but oh, well... Chances are it's not going to be /that/ big a deal since I /am/ on fast ssd, and if it becomes one, I guess I can just setup say firefox-profile-defrag.timer jobs or whatever, as necessary. --- [1] I forgot whether it was ssd behavior, or compression, or what, but something I'm using here apparently forced autodefrag to rewrite the entire file, and a recent "bugfix" changed that so it's more in line with the normal autodefrag behavior. I rather preferred the old behavior, especially since I'm on fast ssd and all my large files tend to be write- once no-rewrite anyway, but I understand the performance implications on large active-rewrite files such as gig-plus database and VM-image files, so... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs space used issue
ing extents), or by using cp's --reflink=never option. (FWIW, I prefer the former, just to be sure, using temporary copies to a suitably sized tmpfs for speed where possible, tho obviously if the file is larger than your memory size that's not possible.) Of course where applicable, snapshots and dedup keep reflink-references to the old extents, so they must be adjusted or deleted as well, to properly free that space. --- [1] du: Because its purpose is different. du's primary purpose is telling you in detail what space files take up, per-file and per- directory, without particular regard to usage on the filesystem itself. df's focus, by contrast, is on the filesystem as a whole. So where two files share the same extent due to reflinking, du should and does count that usage for each file, because that's what each file /uses/ even if they both use the same extents. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ongoing Btrfs stability issues
Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as excerpted: > This will probably sound like an odd question, but does BTRFS think your > storage devices are SSD's or not? Based on what you're saying, it > sounds like you're running into issues resulting from the > over-aggressive SSD 'optimizations' that were done by BTRFS until very > recently. > > You can verify if this is what's causing your problems or not by either > upgrading to a recent mainline kernel version (I know the changes are in > 4.15, I don't remember for certain if they're in 4.14 or not, but I > think they are), or by adding 'nossd' to your mount options, and then > seeing if you still have the problems or not (I suspect this is only > part of it, and thus changing this will reduce the issues, but not > completely eliminate them). Make sure and run a full balance after > changing either item, as the aforementioned 'optimizations' have an > impact on how data is organized on-disk (which is ultimately what causes > the issues), so they will have a lingering effect if you don't balance > everything. According to the wiki, 4.14 does indeed have the ssd changes. According to the bug, he's running 4.13.x on one server and 4.14.x on two. So upgrading the one to 4.14.x should mean all will have that fix. However, without a full balance it /will/ take some time to settle down (again, assuming btrfs was using ssd mode), so the lingering effect could still be creating problems on the 4.14 kernel servers for the moment. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fatal database corruption with btrfs "out of space" with ~50 GB left
Tomasz Chmielewski posted on Thu, 15 Feb 2018 16:02:59 +0900 as excerpted: >> Not sure if the removal of 80G has anything to do with this, but this >> seems that your metadata (along with data) is quite scattered. >> >> It's really recommended to keep some unallocated device space, and one >> of the method to do that is to use balance to free such scattered space >> from data/metadata usage. >> >> And that's why balance routine is recommened for btrfs. > > The balance might work on that server - it's less than 0.5 TB SSD disks. > > However, on multi-terabyte servers with terabytes of data on HDD disks, > running balance is not realistic. We have some servers where balance was > taking 2 months or so, and was not even 50% done. And the IO load the > balance was adding was slowing the things down a lot. Try a filtered balance. Something along the lines of: btrfs balance start -dusage=10 The -dusage number, a limit on the chunk usage percentage, can start small, even 0, and be increased as necessary, until btrfs fi usage reports data size (currently 411 GiB) closer to data usage (currently 246.14 GiB), with the freed space returning to unallocated. I'd shoot for reducing data size to under 300 GiB, thus returning over 100 GiB to unallocated, while hopefully not requiring too high a -dusage percentage and thus too long a balance time. You could get it down under 250 gig size, but that would likely take a lot of rewriting for little additional gain, since with it under 300 gig size you should already have over 100 gig unallocated. Balance time should be quite short for low percentages, with a big payback if there's quite a few chunks with little usage, because at 10%, the filesystem can get rid of 10 chunks while only rewriting the equivalent of a single full chunk. Obviously as the chunk usage percentage goes up, the payback goes down, so at 50%, it can only clear two chunks while writing one, and at 66%, it has to write two chunks worth to clear three. Above that (tho I tend to round up to 70% here) is seldom worth it until the filesystem gets quite full and you're really fighting to keep a few gigs of unallocated space. (As Qu indicated, you always want at least a gig of unallocated space, on at least two devices if you're doing raid1.) If you really wanted you could do the same with -musage for metadata, except that's not so bad, only 9 gig size, 3 gig used. But you could free 5 gigs or so, if desired. That's assuming there's no problem. I see a followup indicating you're seeing problems in dmesg with a balance, however, and will let others deal with that. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
Qu Wenruo posted on Thu, 15 Feb 2018 09:42:27 +0800 as excerpted: > The easiest way to get a basic idea of how large your extent tree is > using debug tree: > > # btrfs-debug-tree -r -t extent > > You would get something like: > btrfs-progs v4.15 extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 > level 0 <<< > total bytes 10737418240 bytes used 393216 uuid > 651fcf0c-0ffd-4351-9721-84b1615f02e0 > > That level is would give you some basic idea of the size of your extent > tree. > > For level 0, it could contains about 400 items for average. > For level 1, it could contains up to 197K items. > ... > For leven n, it could contains up to 400 * 493 ^ (n - 1) items. > ( n <= 7 ) So for level 2 (which I see on a couple of mine here, ran it out of curiosity): 400 * 493 ^ (2 - 1) = 400 * 493 = 197200 197K for both level 1 and level 2? Doesn't look correct. Perhaps you meant a simple power of n, instead of (n-1)? That would yield ~97M for level 2, and would yield the given numbers for levels 0 and 1 as well, whereby using n-1 for level 0 yields less than a single entry, and 400 for level 1. Or the given numbers were for level 1 and 2, with level 0 not holding anything, not levels 0 and 1. But that wouldn't jive with your level 0 example, which I would assume could never happen if it couldn't hold even a single entry. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fatal database corruption with btrfs "out of space" with ~50 GB left
Tomasz Chmielewski posted on Wed, 14 Feb 2018 23:19:20 +0900 as excerpted: > Just FYI, how dangerous running btrfs can be - we had a fatal, > unrecoverable MySQL corruption when btrfs decided to do one of these "I > have ~50 GB left, so let's do out of space (and corrupt some files at > the same time, ha ha!)". Ouch! > Running btrfs RAID-1 with kernel 4.14. Kernel 4.14... quite current... good. But 4.14.0 first release, 4.14.x current stable, or somewhere (where?) in between? And please post the output of btrfs fi usage for that filesystem. Without that (or fi sh and fi df, the pre-usage method of getting nearly the same info), it's hard to say where or what the problem was. Meanwhile, FWIW there was a recent metadata over-reserve bug that should be fixed in 4.15 and the latest 4.14 stable, but IDR whether it affected 4.14.0 original or only the 4.13 series and early 4.14-rcs and was fixed by 4.14.0. The bug seemed to trigger most frequently when doing balances or other major writes to the filesystem, on middle to large sized filesystems. (My all under quarter-TB each btrfs didn't appear to be affected.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
d if you can avoid it because the btrfs check --repair fix is trivial, it's worth doing so. Valid case, but there's nothing in your post indicating it's valid as /your/ case. Of course the other possibility is live-failover, which is sure to be facebook's use-case. But with live-failover, the viability of btrfs check --repair more or less ceases to be of interest, because the failover happens (relative to the offline check or restore time) instantly, and once the failed devices/machine is taken out of service it's far more effective to simply blow away the filesystem (if not replacing the device(s) entirely) and restore "at leisure" from backup, a relatively guaranteed procedure compared to the "no guarantees" of attempting to check --repair the filesystem out of trouble. Which is very likely why the free-space-tree still isn't well supported by btrfs-progs, including btrfs check, several kernel (and thus -progs) development cycles later. The people who really need the one (whichever one of the two)... don't tend to (or at least /shouldn't/) make use of the other so much. It's also worth mentioning that btrfs raid0 mode, as well as single mode, hobbles the btrfs data and metadata integrity feature, because while checksums can and are still generated, stored and checked by default, and integrity problems can still be detected, because raid0 (and single) includes no redundancy, there's no second copy (raid1/10) or parity redundancy (raid5/6) to rebuild the bad data from, so it's simply gone. (Well, for data you can try btrfs restore of the otherwise inaccessible file and hope for the best, and for metadata, you can try check --repair and again hope for the best, but...) If you're using that feature of btrfs and want/need more than just detection of a problem that can't be fixed due to lack of redundancy, there's a good chance you want a real redundancy raid mode on multi-device, or dup mode on single device. So bottom line... given the sacrificial lack of redundancy and reliability of raid0, btrfs or not, in an enterprise setting with tens of TB of data, why are you worrying about the viability of btrfs check -- repair on what the placement on raid0 decrees to be throw-away data anyway? At first glance anyway, one or the other, either the raid0 mode and thus declared throw-away value of tens of TB of data, or the viability of btrfs check --repair, indicating you don't consider the data you just declared to be of throw-away value by placing it on raid0, to be of throw-away value after all, must be wrong. Which one is wrong is your call, and there's certainly individual cases (one of which I even named) where concern about the viability of btrfs check --repair on raid0 might be valid, but your post has no real indication that your case is such a case, and honestly, that worries me! > 2. There's another thread on-going about mount delays. I've been > completely blind to this specific problem until it caught my eye. Does > anyone have ballpark estimates for how long very large HDD-based > filesystems will take to mount? Yes, I know it will depend on the > dataset. I'm looking for O() worst-case approximations for > enterprise-grade large drives (12/14TB), as I expect it should scale > with multiple drives so approximating for a single drive should be good > enough. No input on that question here (my own use-case couldn't be more different, multiple small sub-half-TB independent btrfs raid1s on partitioned ssds), but another concern, based on real-world reports I've seen on-list: 12-14 TB individual drives? While you /did/ say enterprise grade so this probably doesn't apply to you, it might apply to others that will read this. Be careful that you're not trying to use the "archive application" targeted SMR drives for general purpose use. Occasionally people will try to buy and use such drives in general purpose use due to their cheaper per-TB cost, and it just doesn't go well. We've had a number of reports of that. =:^( -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs - kernel warning
Duncan posted on Fri, 02 Feb 2018 02:49:52 + as excerpted: > As CMurphy says, 4.11-ish is starting to be reasonable. But you're on > the LTS kernel 4.14 series and userspace 4.14 was developed in parallel, > so btrfs-progs-3.14 would be ideal. Umm... obviously that should be 4.14. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs - kernel warning
ast backup and the current state. As soon as the change to your data since the last backup becomes more valuable than the time/trouble/resources necessary to update your backup, you will do so. If you haven't, it simply means you're defining the changes since your last backup as of less value than the time/trouble/resources necessary to do that update, so again, you can *always* rest easy in the face of filesystem or device problems, because you either have it backed up, or by definition of /not/ having it backed up, it was self-evidently not worth the trouble to do so yet, so you saved what was most important to you either way. So think about your value definitions regarding your data and change them if you need to... while you still have the chance. =:^) (And the implications of the above change how you deal with a broken filesystem too. With either current backups or what you've literally defined as throw-away data due to it not being worth the trouble of backups, it makes little sense to spend more than a trivial amount of time trying to recover data from a messed up filesystem, especially given that there's no guarantee you'll get it all back undamaged even if you /do/ spend time time. It's often simpler and takes less time, as well as more success-sure, to simply blow away the defective filesystem with a fresh mkfs and restore the data from backups, since that way you know you'll have a fresh filesystem and known-good data from the backup, as opposed to no guarantees /what/ you'll end up with trying to recover/ repair the old filesystem.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
Andrei Borzenkov posted on Sun, 28 Jan 2018 11:06:06 +0300 as excerpted: > 27.01.2018 18:22, Duncan пишет: >> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted: >> >>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote: >>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote: >>>> >>>>>> I just tested to boot with a single drive (raid1 degraded), even >>>>>> with degraded option in fstab and grub, unable to boot ! The boot >>>>>> process stop on initramfs. >>>>>> >>>>>> Is there a solution to boot with systemd and degraded array ? >>>>> >>>>> No. It is finger pointing. Both btrfs and systemd developers say >>>>> everything is fine from their point of view. >>> >>> It's quite obvious who's the culprit: every single remaining rc system >>> manages to mount degraded btrfs without problems. They just don't try >>> to outsmart the kernel. >> >> No kidding. >> >> All systemd has to do is leave the mount alone that the kernel has >> already done, > > Are you sure you really understand the problem? No mount happens because > systemd waits for indication that it can mount and it never gets this > indication. As Tomaz indicates, I'm talking about manual mounting (after the initr* drops to a maintenance prompt if it's root being mounted, or on manual mount later if it's an optional mount) here. The kernel accepts the degraded mount and it's mounted for a fraction of a second, but systemd actually undoes the successful work of the kernel to mount it, so by the time the prompt returns and a user can check, the filesystem is unmounted again, with the only indication that it was mounted at all being the log. He says that's because the kernel still says it's not ready, but that's for /normal/ mounting. The kernel accepted the degraded mount and actually mounted the filesystem, but systemd undoes that. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted: > On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote: >> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote: >> >> >> I just tested to boot with a single drive (raid1 degraded), even >> >> with degraded option in fstab and grub, unable to boot ! The boot >> >> process stop on initramfs. >> >> >> >> Is there a solution to boot with systemd and degraded array ? >> > >> > No. It is finger pointing. Both btrfs and systemd developers say >> > everything is fine from their point of view. > > It's quite obvious who's the culprit: every single remaining rc system > manages to mount degraded btrfs without problems. They just don't try > to outsmart the kernel. No kidding. All systemd has to do is leave the mount alone that the kernel has already done, instead of insisting it knows what's going on better than the kernel does, and immediately umounting it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bad key ordering - repairable?
tube or whatever on fullscreen, and to now my second generation of ssds, a pair of 1 TB samsung evos, but this reminds me that at nearing six years old the main system's aging too, so I better start thinking of replacing it again...) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Periodic frame losses when recording to btrfs volume with OBS
ein posted on Tue, 23 Jan 2018 09:38:13 +0100 as excerpted: > On 01/22/2018 09:59 AM, Duncan wrote: >> >> And to tie up a loose end, xfs has somewhat different design principles >> and may well not be particularly sensitive to the dirty_* settings, >> while btrfs, due to COW and other design choices, is likely more >> sensitive to them than the widely used ext* and reiserfs (my old choice >> and the basis of my own settings, above). > Excellent booklike writeup showing how /proc/sys/vm/ works, but I > wonder, how can you explain why does XFS work in this case? I can't, directly, which is why I glossed over it so fast above. I do have some "educated guesswork", but that's _all_ it is, as I've not had reason to get particularly familiar with xfs and its quirks. You'd have to ask the xfs folks if my _guess_ is anything approaching reality, but if you do please be clear that I explicitly said I don't know and that this is simply my best guess based on the very limited exposure to xfs discussions I've had. So I'm not experience-familiar with xfs and other than what I've happened across in cross-list threads here, know little about it except that it was ported to Linux from other *ix. I understand the xfs port to "native" is far more complete than that of zfs, for example. Additionally, I know from various vfs discussion threads cross-posted to this and other filesystem lists that xfs remains rather different than some -- apparently (if I've gotten it right) it handles "objects" rather than inodes and extents, for instance. Apparently, if the vfs threads I've read are to be believed, xfs would have some trouble with a proposed vfs interface that would allow requests to write out and free N pages or N KiB of dirty RAM from the write buffers in ordered to clear memory for other usage, because it tracks objects rather than dirty pages/KiB of RAM. Sure it could do it, but it wouldn't be an efficient enough operation to be worth the trouble for xfs. So apparently xfs just won't make use of that feature of the proposed new vfs API, there's nothing that says it /has/ to, after all -- it's proposed to be optional, not mandatory. Now that discussion was in a somewhat different context than the vm.dirty_* settings discussion here, but it seems reasonable to assume that if xfs would have trouble converting objects to the size of the memory they take in the one case, the /proc/sys/vm/dirty_* dirty writeback cache tweaking features may not apply to xfs, at least in a direct/ intuitive way, either. Which is why I suggested xfs might not be particularly sensitive to those settings -- I don't know that it ignores them entirely, and it may use them in /some/ way, possibly indirectly, but the evidence I've seen does suggest that xfs may, if it uses those settings at all, not be as sensitive to them as btrfs/reiserfs/ext*. Meanwhile, due to the extra work btrfs does with checksumming and cow, while AFAIK it uses the settings "straight", having them out of whack likely has a stronger effect on btrfs than it does on ext* and reiserfs (with reiserfs likely being slightly more strongly affected than ext*, but not to the level of btrfs). And there has indeed been confirmation on-list that adjusting these settings *does* have a very favorable effect on btrfs for /some/ use- cases. (In one particular case, the posting was to the main LKML, but on btrfs IIRC, and Linus got involved. I don't believe that lead to the /creation/ of the relatively new per-device throttling stuff as I believe the patches were already around, but I suspect it may have lead to their integration in mainline a few kernel cycles earlier than they may have been otherwise. Because it's a reasonably well known "secret" that the default ratios are out of whack on modern systems, it's just not settled what the new defaults /should/ be, so in the absence of agreement or pressing problem, they remain as they are. But Linus blew his top as he's known to do, he and others pointed the reporter at the vm.dirty_* settings tho Linus wanted to know why the defaults were so insane for today's machines, and tweaking those did indeed help. Then a kernel cycle or two later the throttling options appeared in mainline, very possibly as a result of Linus "routing around the problem" to some extent.) So in my head I have a picture of the possible continuum of vm.dirty_ effect that looks like this: <- weak effectstrong -> zfsxfs.ext*reiserfs.btrfs zfs, no or almost no effect, because it uses non-native mechanism and is poorly adapted to Linux. xfs, possibly some effect, but likely relatively light, becaus
Re: Periodic frame losses when recording to btrfs volume with OBS
- the default is the venerable CFQ but deadline may well be better for a streaming use- case, and now there's the new multi-queue stuff and the multi-queue kyber and bfq schedulers, as well -- and setting IO priority -- probably by increasing the IO priority of the streaming app. The tool to use for the latter is called ionice. Do note, however, that not all schedulers implement IO priorities. CFQ does, but while I think deadline should work better for the streaming use-case, it's simpler code and I don't believe it implements IO priority. Similarly for multi-queue, I'd guess the low-code-designed-for-fast-direct-PCIE-connected-SSD kyber doesn't implement IO priorities, while the more complex and general purpose suitable-for-spinning-rust bfq /might/ implement IO priorities. But I know less about that stuff and it's googlable, should you decide to try playing with it too. I know what the dirty_* stuff does from personal experience. =:^) And to tie up a loose end, xfs has somewhat different design principles and may well not be particularly sensitive to the dirty_* settings, while btrfs, due to COW and other design choices, is likely more sensitive to them than the widely used ext* and reiserfs (my old choice and the basis of my own settings, above). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs volume corrupt. btrfs-progs bug or need to rebuild volume?
Rosen Penev posted on Fri, 19 Jan 2018 13:45:35 -0800 as excerpted: > v2: Add proper subject =:^) > I've been playing around with a specific kernel on a specific device > trying to figure out why btrfs keeps throwing csum errors after ~15 > hours. I've almost nailed it down to some specific CONFIG option in the > kernel, possibly related to IRQs. > > Anyway, I managed to get my btrfs RAID5 array corrupted to the point > where it will just mount to read-only mode. [...] > This is with version 4.14 of btrfs-progs. Do I need a newer version or > should I just reinitialize my array and copy everything back? > > Log on mount attached below: [...] > Fri Jan 19 14:26:08 2018 kern.warn kernel: > [168383.378239] CPU: 0 PID: > 2496 Comm: kworker/u8:2 Tainted: GW 4.9.75 #0 Tho as the penultimate LTS kernel series 4.9 is still on the btrfs-list supported list in general... 4.9 still had known btrfs raid56 mode issues and is strongly negatively recommended for use with btrfs raid56 mode. Those weren't fixed until 4.12, which /finally/ brought raid56 mode into generally working and not negatively recommended state. While as an LTS applicable general btrfs bug fixes would be backported to 4.9, because raid56 mode had never worked /well/ at that point, I'm not sure those fixes were backported. So you really need either kernel 4.12+, presumably the LTS 4.14 series since you're on LTS 4.9 series now, for btrfs raid56 mode, or don't use raid56 mode if you plan on staying with the 4.9 LTS, as it still had severe known issues back then and I haven't seen on-list confirmation that the 4.12 btrfs raid56 mode fixes were backported to 4.9-LTS. If you need/choose to stick with 4.9 and dump raid56 mode, the recommended alternative depends on the number of devices in the filesystem. For a small number of devices in the filesystem, btrfs raid1 is effectively as stable as the still stabilizing and maturing btrfs itself is at this point and is recommended. For a larger number of devices, btrfs raid1 is still a good choice because it /is/ the most mature, but btrfs raid10 is /reasonably/ stable tho IMO not quite as stable as raid1, or for better performance (due to btrfs raid10 not being read-optimized yet) while keeping btrfs checksumming and error repair from the second copy when available, consider a layered approach, with btrfs raid1 on top of a pair of mdraid0s (or dmraid0s, or hardware raid0s). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: big volumes only work reliable with ssd_spread
Stefan Priebe - Profihost AG posted on Mon, 15 Jan 2018 10:55:42 +0100 as excerpted: > since around two or three years i'm using btrfs for incremental VM > backups. > > some data: > - volume size 60TB > - around 2000 subvolumes > - each differential backup stacks on top of a subvolume > - compress-force=zstd > - space_cache=v2 > - no quote / qgroup > > this works fine since Kernel 4.14 except that i need ssd_spread as an > option. If i do not use ssd_spread i always end up with very slow > performance and a single kworker process using 100% CPU after some days. > > With ssd_spread those boxes run fine since around 6 month. Is this > something expected? I haven't found any hint regarding such an impact. My understanding of the technical details is "limited" as I'm not a dev, and I expect you'll get a more technically accurate response later, but sometimes a first not particularly technical response can be helpful as long as it's not /wrong/. (And if it is this is a good way to have my understanding corrected as well. =:^) With that caveat, based on my understanding of what I've seen on-list... The kernel v4.14 ssd mount-option changes apparently primarily affected data, not metadata. Apparently, ssd_spread has a heavier metadata effect, and the v4.14 changes moved additional (I believe metadata) functionality to ssd-spread that had originally been part of ssd as well. There has been some discussion of metadata tweaks similar to those in 4.14 for the ssd option with data, but they weren't deemed as demonstrably needed as the ssd option tweaks and needed further discussion, so were put off until the effect of the 4.14 tweaks could be gauged in more widespread use, after which they were to be reconsidered, if necessary. Meanwhile, in the discussion I saw, Chris Mason mentioned that Facebook is using ssd-spread for various reasons there, so it's well-tested with their deployments, which I'd assume have many of the same qualities yours do, thus implying that your observations about ssd-spread are no accident. In fact, if I interpreted Chris's comments correctly, they use ssd_spread on very large multi-layered non-ssd storage arrays, in part because the larger layout-alignment optimizations make sense there as well as on ssds. That would appear to be precisely what you are seeing. =:^) If that's the case, then arguably the option is misnamed and the ssd_spread name may well at some point be deprecated in favor of something more descriptive of its actual function and target devices. Purely my own speculation here, but perhaps something like vla_spread (very-large- array)? -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hanging after frequent use of systemd-nspawn --ephemeral
Qu Wenruo posted on Sun, 14 Jan 2018 10:27:40 +0800 as excerpted: > Despite of that, did that really hangs? > Qgroup dramatically increase overhead to delete a subvolume or balance > the fs. > Maybe it's just a little slow? Same question about the "hang" here. Note that btrfs is optimized to make snapshot creation fast, while snapshot deletion has to do more work to clean things up. So even without qgroup enabled, deletion can take a bit of time (much longer than creation, which should be nearly instantaneous in human terms) if there's a lot of relinks and the like to clean up. And qgroups makes btrfs do much more work to track that as well, so as Qu says, that'll make snapshot deletion take even longer, and you probably want it disabled unless you actually need the feature for something you're doing. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html