Rich Freeman <[email protected]> writes: > On Tue, Jan 5, 2016 at 5:16 PM, lee <[email protected]> wrote: >> Rich Freeman <[email protected]> writes: >> >>> >>> I would run btrfs on bare partitions and use btrfs's raid1 >>> capabilities. You're almost certainly going to get better >>> performance, and you get more data integrity features. >> >> That would require me to set up software raid with mdadm as well, for >> the swap partition. > > Correct, if you don't want a panic if a single swap drive fails. > >> >>> If you have a silent corruption with mdadm doing the raid1 then btrfs >>> will happily warn you of your problem and you're going to have a >>> really hard time fixing it, >> >> BTW, what do you do when you have silent corruption on a swap partition? >> Is that possible, or does swapping use its own checksums? > > If the kernel pages in data from the good mirror, nothing happens. If > the kernel pages in data from the bad mirror, then whatever data > happens to be there is what will get loaded and used and/or executed. > If you're lucky the modified data will be part of unused heap or > something. If not, well, just about anything could happen. > > Nothing in this scenario will check that the data is correct, except > for a forced scrub of the disks. A scrub would probably detect the > error, but I don't think mdadm has any ability to recover it. Your > best bet is probably to try to immediately reboot and save what you > can, or a less-risky solution assuming you don't have anything > critical in RAM is to just do an immediate hard reset so that there is > no risk of bad data getting swapped in and overwriting good data on > your normal filesystems.
Then you might be better off with no swap unless you put it on a file system that uses check sums. >> It's still odd. I already have two different file systems and the >> overhead of one kind of software raid while I would rather stick to one >> file system. With btrfs, I'd still have two different file systems --- >> plus mdadm and the overhead of three different kinds of software raid. > > I'm not sure why you'd need two different filesystems. btrfs and zfs I won't put my data on btrfs for at least quite a while. > Just btrfs for your data. I'm not sure where you're counting three > types of software raid either - you just have your swap. btrfs raid is software raid, zfs raid is software raid, mdadm is software raid. That makes three different sofware raids. > And I don't think any of this involves any significant overhead, other > than configuration. mdadm does have a very significant performance overhead. ZFS mirror performance seems to be rather poor. I don't know how much overhead is involved with zfs and btrfs software raid, yet since they basically all do the same thing, I have my doubts that the overhead is significantly lower than the overhead of mdadm. >> How would it be so much better to triple the software raids and to still >> have the same number of file systems? > > Well, the difference would be more data integrity insofar as hardware > failure goes, but certainly more risk of logical errors (IMO). There would be a possibility for more data integrity for the root file system, assuming that btrfs is as reliable as ext4 on hardware raid. Is it? That's about 10GB, mostly read and not written to. It would be a very minor improvement, if any. >>>> When you use hardware raid, it >>>> can be disadvantageous compared to btrfs-raid --- and when you use it >>>> anyway, things are suddenly much more straightforward because everything >>>> is on raid to begin with. >>> >>> I'd stick with mdadm. You're never going to run mixed >>> btrfs/hardware-raid on a single drive, >> >> A single disk doesn't make for a raid. > > You misunderstood my statement. If you have two drives, you can't run > both hardware raid and btrfs raid across them. Hardware raid setups > don't generally support running across only part of a drive, and in > this setup you'd have to run hardware raid on part of each of two > single drives. I have two drives to hold the root file system and the swap space. The raid controller they'd be connected do does not support using disks partially. >>> and the only time I'd consider >>> hardware raid is with a high quality raid card. You'd still have to >>> convince me not to use mdadm even if I had one of those lying around. >> >> From my own experience, I can tell you that mdadm already does have >> significant overhead when you use a raid1 of two disks and a raid5 with >> three disks. This overhead may be somewhat due to the SATA controller >> not being as capable as one would expect --- yet that doesn't matter >> because one thing you're looking at, besides reliability, is the overall >> performance. And the overall performance very noticeably increased when >> I migrated from mdadm raids to hardware raids, with the same disks and >> the same hardware, except that the raid card was added. > > Well, sure, the raid card probably had battery-backed cache if it was > decent, so linux could complete its commits to RAM and not have to > wait for the disks. yes >> And that was only 5 disks. I also know that the performance with a ZFS >> mirror with two disks was disappointingly poor. Those disks aren't >> exactly fast, but still. I haven't tested yet if it changed after >> adding 4 mirrored disks to the pool. And I know that the performance of >> another hardware raid5 with 6 disks was very good. > > You're probably going to find the performance of a COW filesystem to > be inferior to that of an overwrite-in-place filesystem, simply > because the latter has to do less work. Reading isn't as fast as I would expect, either. >> Thus I'm not convinced that software raid is the way to go. I wish they >> would make hardware ZFS (or btrfs, if it ever becomes reliable) >> controllers. > > I doubt it would perform any better. What would that controller do > that your CPU wouldn't do? The CPU wouldn't need to do what the controller does and have time to do other things instead. > Well, other than have battery-backed cache, which would help in any > circumstance. If you stuck 5 raid cards in your PC and put one drive > on each card and put mdadm or ZFS across all five it would almost > certainly perform better because you're adding battery-backed cache. It's probably not only that. A 512MB cache probably doesn't make that much difference. I'm guessing that the SATA controller might be overwhelmed when it has to handle 5 disks simultaneously while the hardware raid controller is designed to handle up to 256 disks simultaneously and thus does a much better job with a couple disks, taking the load off of the rest of the system. In the end, it doesn't really matter what exactly causes the difference in performance. What matters is that the performance is so much better. >> The relevant advantage of btrfs is being able to make snapshots. Is >> that worth all the (potential) trouble? Snapshots are worthless when >> the file system destroys them with the rest of the data. > > And that is why I wouldn't use btrfs on a production system unless the > use case mitigated this risk and there was benefit from the snapshots. > Of course you're taking on more risk using an experimental filesystem. Yes, and I'd have other disadvantages. I've come to think that being able to make snapshots isn't worth all the trouble. >>> btrfs does not support swap files at present. >> >> What happens when you try it? > > No idea. Should be easy to test in a VM. I suspect either an error > or a kernel bug/panic/etc. If it's that bad, that doesn't sound like a file system ready to be used yet. >>> When it does you'll need to disable COW for them (using chattr) >>> otherwise they'll be fragmented until your system grinds to a halt. A >>> swap file is about the worst case scenario for any COW filesystem - >>> I'm not sure how ZFS handles them. >> >> Well, then they need to make special provisions for swap files in btrfs >> so that we can finally get rid of the swap partitions. > > I'm sure they'll happily accept patches. :) I'm sure they won't. The thing is that like everyone says they appreciate contributions, bug reports and patches while they make it more or less impossible to contribute or show no interest in getting contributions, don't look at bug reports or close them automatically and prematurely or show a great deal of disinterest in them and decline any patches should you have ventured to provide some. You'd be misguided to think that anyone cares about or wants your contribution. If you make one, you're making it only for yourself. I don't even make bug reports anymore because it's useless. >>> If I had done that in the past I think I would have completely avoided >>> that issue that required me to restore from backups. That happened in >>> the 3.15/3.16 timeframe and I'd have never even run those kernels. >>> They were stable kernels at the time, and a few versions in when I >>> switched to them (I was probably just following gentoo-sources stable >>> keywords back then), but they still had regressions (fixes were >>> eventually backported). >> >> How do you know if an old kernel you pick because you think the btrfs >> part works well enough is the right pick? You can either encounter a >> bug that has been fixed or a regression that hasn't been >> discovered/fixed yet. That way, you can't win. > > You read the lists closely. If you want to be bleeding-edge it will > take more work than if you just go with the flow. That's why I'm not > on 4.1 yet - I read the lists and am not quite sure they're ready yet. That sounds like a lot of work. You seem to really be going to lengths to use btrfs. >>> I think btrfs is certainly usable today, though I'd be hesitant to run >>> it on production servers depending on the use case (I'd be looking for >>> a use case that actually has a significant benefit from using btrfs, >>> and which somehow mitigates the risks). >> >> There you go, it's usable, and the risk of using it is too high. > > That is a judgement that everybody has to make based on their > requirements. The important thing is to make an informed decision. I > don't get paid if you pick btrfs. Being more informed doesn't magically result in better decisions. Information, as is knowledge, is volatile and fluid; software is power, while making decisions is only a freedom. >>> Right now I keep a daily rsnapshot (rsync on steroids - it's in the >>> Gentoo repo) backup of my btrfs filesystems on ext4. I occasionally >>> debate whether I still need it, but I sleep better knowing I have it. >>> This is in addition to my daily duplicity cloud backups of my most >>> important data (so, /etc and /home are in the cloud, and mythtv's >>> /var/video is just on a local rsync backup). >> >> I wouldn't give my data out of my hands. > > Somehow I doubt the folks at Amazon are going to break RSA anytime soon. Which means? >> Snapper? I've never heard of that ... >> > > http://snapper.io/ > > Basically snapshots+crontab and some wrappers to set retention > policies and such. That and some things like package-manager plugins > so that you get snapshots before you install stuff. Does this make things easier or more complicated? Like I fail to understand what's supposed to be so great about zfs incremental snapshots to get backups. Apparently you'd have to pile up an indefinite amount of snapshots so you can increment them indefinitely. And it gets extremely scary when you want to remove them to get back to something sane. >> Queuing up the data when there's more data than the system can deal with >> only works when the system has sufficient time to catch up with the >> queue. Otherwise, you have to block something at some point, or you >> must drop the data. At that point, it doesn't matter how you arrange >> the contents of the queue within it. > > Absolutely true. You need to throttle the data before it gets into > the queue, so that the business of the queue is exposed to the > applications so that they behave appropriately (falling back to > lower-bandwidth alternatives, etc). In my case if mythtv's write > buffers are filling up and I'm also running an emerge install phase > the correct answer (per ionice) is for emerge to block so that my > realtime video capture buffers are safely flushed. What you don't > want is for the kernel to let emerge dump a few GB of low-priority > data into the write cache alongside my 5Mbps HD recording stream. > Granted, it isn't as big a problem as it used to be now that RAM sizes > have increased. You could re-arrange the queue, and when it's long enough, you don't need to freeze anything. But what does, for example, a web browser do when it cannot receive data as fast as it can display it, or what does a VOIP application do when it cannot send the data as fast as it wants to? I don't want my web browser to freeze, and a speaker whose voice is supposed to be transmitted over a network cannot be frozen in their speech to give sufficient time for the queue to become empty. >> Gentoo /is/ fire-and-forget in that it works fine. Btrfs is not in that >> it may work or not. >> > > Well, we certainly must have come a long way then. :) I still > remember the last time the glibc ABI changed and I was basically > rebuilding everything from single-user mode holding my breath. Did it work?

