Re: btrfs und lvm-cache?
Am Mittwoch, 23. Dezember 2015, 11:45:28 CET schrieb Neuer User: > Hello Hi. > I want to setup a small homeserver, based on a HP Microserver Gen8 (4GB > RAM, 2x3TB HDD + 1x120GB SSD) and Proxmox as distro. > > The server will be used to host a (small) number of virtual machines, > most of them being LXC containers, few being KVM machines. One of the > LXC containers will host a fileserver with app 1 TB of data and another > one a backup system for the desktops / laptops in my household, thus > probably holding quite a lot of files. The lxc containers will use the > filesystem of the proxmox host, the KVM machines probably raw disk files > (or qcow2). > > I would like to combine high data integrity with some speed, so I > thought of the following layout: > > - both hdd and ssd in one LVM VG > - one LV on each hdd, containing a btrfs filesystem > - both btrfs LV configured as RAID1 > - the single SDD used as a LVM cache device for both HDD LVs to speed up > random access, where possible > > Now, I wonder if that is a good architecture to go for. Any input on > that? Is btrfs the right way to go for, or should I better go for ZFS > (and purchase some more gigs of RAM)? > > Will there be any problems arising from the lvmcache? btrfs only sees > the HDDs, LVM does the SDD handling. As far as I understand this way you basically loose the RAID 1 semantics of BTRFS. While the data is redundant on the HDDs, it is not redundant on the SSD. It may work for a pure read cache, but for write-through you definately loose any data integrity protection a RAID 1 gives you. Of course, you can use two SSDs and have them work as RAID 1 as well. There is a patch set for in-BTRFS SSD-caching. It consists of a patch set to add hot data tracking to VFS and a patch set for adding support in BTRFS. But I didn´t see anything of these in quite some time. Happy christmas, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs und lvm-cache?
Am 23.12.2015 um 12:21 schrieb Martin Steigerwald: > Hi. > > As far as I understand this way you basically loose the RAID 1 semantics of > BTRFS. While the data is redundant on the HDDs, it is not redundant on the > SSD. It may work for a pure read cache, but for write-through you definately > loose any data integrity protection a RAID 1 gives you. > Hmm, are you sure? I thought LVM lies underneath btrfs. Btrfs thus should not know about the caching SSD at all. It only knows of the two LVs on the HDDs, reading and writing data from or to one or both of the two LVs. Only then lvmcache decides if it reads the data from the underlying HDD or from the cache ssd. LVM shouldn't even know that the two LVs are configured as RAID1 on btrfs as this is a level higher. So for LVM the two LVs are diffeent data, both of which would need to be cached independently on the SDD. What might happen though, is that there is a data loss on the SDD, returning a mismatching checksum, so btrfs might think that the data is incorrect on one LV (=HDD), although it is indeed correct there. That would lead btrfs to read the data from the second LV (which might also be in the SDD cache or not) and then updating the (correct and identical) data of the first LV with it. Or do I see that wrong? > Of course, you can use two SSDs and have them work as RAID 1 as well. > > There is a patch set for in-BTRFS SSD-caching. It consists of a patch set to > add hot data tracking to VFS and a patch set for adding support in BTRFS. But > I didn´t see anything of these in quite some time. That would be interesting, but for my project it's probably too late. > > Happy christmas, > Yeah, happy christmas to you and eveybody on the list. Michael -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Loss of connection to Half of the drives
On Tue, Dec 22, 2015 at 10:13 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Donald Pearson posted on Tue, 22 Dec 2015 17:56:29 -0600 as excerpted: > > >>> Also understand with Brfs RAID 10 you can't lose more than 1 drive >>> reliably. It's not like a strict raid1+0 where you can lose all of the >>> "copy 1" *OR* "copy 2" mirrors. >> >> Pardon my pea brain but this sounds like a pretty bad design flaw? > > It's not a design flaw, it's EUNIMPLEMENTED. Btrfs raid1, unlike say > mdraid1 (and now various hardware raid vendors), implements exactly two > copy raid1 -- each chunk is mirrored to exactly two devices. And btrfs > raid10, because it builds on btrfs raid1, is likewise exactly two copies. > > With raid1 on two devices, where those two copies go is defined, one to > each device. With raid1 on more than two devices, the current chunk- > allocator will allocate one copy each to the two devices with the most > free space left, so that if the devices are all the same size, they'll > all be used to about the same level and will run out of space at about > the same time. (If they're not the same size, with one much larger than > the others, it'll get one copy all the time, with the other copy going to > the second largest or to each in turn once remaining empty sizes even > out.) > > Similarly with raid10, except each strip is two-way mirrored and a stripe > created of the mirrors. > > And because the raid is managed and allocated per-chunk, drop more than a > single device, and it's very likely you _will_ be dropping both copies of > _some_ chunks on raid1, and some strips of chunks on raid10, making them > entirely unavailable. > > In that case you _might_ be able to mount degraded,ro, but you won't be > able to mount writable. > > The other btrfs-only alternative at this point would be btrfs raid6, > which should let you drop TWO devices before data is simply missing and > unrecreatable from parity. But btrfs raid6 is far newer and less mature > than either raid1 or raid10, and running the truly latest versions is > very strongly recommended upto v4.4 or so, which is actually soon to be > released now, as older versions WILL quite likely have issues. As it > happens, kernel v4.4 is an LTS series, so the timing for btrfs raid5 and > raid6 there is quite nice, as 4.4 should see them finally reasonably > stable, and being LTS, should continue to be supported for quite some > time. > > (The current btrfs list recommendation in general is to stay within two > LTS versions in ordered to avoid getting /too/ far behind, as while > stabilizing, btrfs isn't entirely stable and mature yet, and further back > then that simply gets unrealistic to support very well. That's 3.18 and > 4.1 currently, with 3.18 being soon to drop as 4.4 is soon to release as > the next LTS. But as btrfs stabilizes further, it's somewhat likely that > 4.1 or at least 4.4, will continue to be reasonably supported beyond the > second LTS back phase, perhaps to the third, and sometime after that, > support will probably last more or less as long as the LTS stable branch > continues getting updates.) > > But even btrfs raid6 only lets you drop two devices before general data > loss occurs. > > The other alternative, as regularly used and recommended by one regular > poster here, would be btrfs raid1 on top of mdraid0 or possibly mdraid10 > or whatever. The same general principle would apply to btrfs raid5 and > raid6 as they mature, on top of mdraidN, with the important point being > that the btrfs level has redundancy, raid1/10/5/6, since it has real-time > data and metadata checksumming and integrity management features that are > lacking in mdraid. By putting the btrfs raid with either redundancy or > parity on top, you get the benefit of actual error recovery that would be > lacking if it was btrfs raid0 on top. > > That would let you manage loss of one entire set of the underlying mdraid > devices, one copy of the overlying btrfs raid1/10 or one strip/parity of > btrfs raid5, which could then be rebuilt from the other two, while > maintaining btrfs data and metadata integrity as one copy (or stripe- > minus-one-plus-one-parity) would always exist. With btrfs raid6, it > would of course let you lose two of the underlying sets of devices > composing the btrfs raid6. > > In the precise scenario the OP posted, that would work well, since in the > huge numbers of devices going offline case, it'd always be complete sets > of devices, corresponding to one of the underlying mdraidNs, because the > scenario is that set getting unplugged or whatever. > > Of course in the more general random N devices going offline case, with > the N devices coming from any of the underlying mdraidNs, it could still > result in not all data being available to the btrfs raid level, but > except for mdraid0, the chances of it happening are still relatively low, > and with mdraid0, they're still within reason, if not /as/ low. But that > general scenario isn't
Re: Loss of connection to Half of the drives
On 2015-12-23 16:53, Donald Pearson wrote: [...] > > Additionally real Raid10 will run circles around what BTRFS is doing > in terms of performance. In the 20 drive array you're striping across > 10 drives, in BTRFS right now you're striping across 2 no matter what. > So not only do I lose in terms of resilience I lose in terms of > performance. I assume that N-way-mirroring used with BTRFS Raid10 > will also increase the stripe width so that will level out the > performance but you're always going to be short a drive for equal > resilience. In case of RAID10,on the best of my knowledge, BTRFS allocate each CHUNK across *all* the available devices. It uses the usual RAID0 (==striping) over a RAID1 (mirroring). What you are describing is the BTRFS RAID1; i.e. LINEAR over a RAID1:each chunk is allocated in *two*, only *two* different disks from the disks pool; the disks are the ones with the largest free space. Each chunk may be allocated on a different *pair* of disks. > And finally the elephant in the room that comes with the necessary > 11-way mirroring is that the usable capacity of that 20 drive array. > Remember, pea brain so my math may be wrong in application and > calculation but if it's made of 1T drives for 20T raw, there is only > 1.82T usable (20 / 11) and if I'm completely off in that figure the > point is still that such a high level of mirroring is going to > excessively consume drive space. Ducan talked about a N-way mirroring, where each disks contains a copy of the same data. Nobody talked about N-way mirroring where N is less than the number of the available disks. To be honest in the past appeared some patches to implement a generalized RAID-NxM raid, where N are the total disk, M are the redundancy disks: i.e. the filesystem could allow a drop of M disks (see http://www.spinics.net/lists/linux-btrfs/msg29245.html). BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs und lvm-cache?
On Wed, Dec 23, 2015 at 4:38 AM, Neuer Userwrote: > Am 23.12.2015 um 12:21 schrieb Martin Steigerwald: >> Hi. >> >> As far as I understand this way you basically loose the RAID 1 semantics of >> BTRFS. While the data is redundant on the HDDs, it is not redundant on the >> SSD. It may work for a pure read cache, but for write-through you definately >> loose any data integrity protection a RAID 1 gives you. >> > Hmm, are you sure? I thought LVM lies underneath btrfs. Btrfs thus > should not know about the caching SSD at all. It only knows of the two > LVs on the HDDs, reading and writing data from or to one or both of the > two LVs. > > Only then lvmcache decides if it reads the data from the underlying HDD > or from the cache ssd. LVM shouldn't even know that the two LVs are > configured as RAID1 on btrfs as this is a level higher. So for LVM the > two LVs are diffeent data, both of which would need to be cached > independently on the SDD. > > What might happen though, is that there is a data loss on the SDD, > returning a mismatching checksum, so btrfs might think that the data is > incorrect on one LV (=HDD), although it is indeed correct there. That > would lead btrfs to read the data from the second LV (which might also > be in the SDD cache or not) and then updating the (correct and > identical) data of the first LV with it. Seems to me if the LV's on the two HDDs are exposed, the lvmcache has to separately keep track of those LVs. So as long as everything is working correctly, it should be fine. That includes either transient or persistent, but consistent, errors for either HDD or the SSD, and Btrfs can fix up those bad reads with data from the other. If the SSD were to decide to go nutty, chances are reads through lvmcache would be corrupt no matter what LV is being read by Btrfs, and it'll be aware of that and discard those reads. Any corrupt writes in this case, won't be immediately known by Btrfs because it (like any file system) assumes writes are OK unless the device reports a write failure, but those too would be found on read. The question I have, that I don't know the answer to, is if the stack arrives at a point where all writes are corrupt but hardware isn't reporting write errors, and it continues to happen for a while, once you've resolved that problem and try to mount the file system again, how well does Btrfs disregard all those bad writes? How well would any filesystem? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs und lvm-cache?
One other thing: I read that btrfs has some options that are turned off for SSDs as they might be harmful or so. In my case btrfs, however, would not know about the SSD and probably use its HDD optimized settings. The result, however, would be forwared also to the SSD via lvmcache. Do I see that right? Would that give any serious problems? Am 23.12.2015 um 11:45 schrieb Neuer User: > Hello > > I want to setup a small homeserver, based on a HP Microserver Gen8 (4GB > RAM, 2x3TB HDD + 1x120GB SSD) and Proxmox as distro. > > The server will be used to host a (small) number of virtual machines, > most of them being LXC containers, few being KVM machines. One of the > LXC containers will host a fileserver with app 1 TB of data and another > one a backup system for the desktops / laptops in my household, thus > probably holding quite a lot of files. The lxc containers will use the > filesystem of the proxmox host, the KVM machines probably raw disk files > (or qcow2). > > I would like to combine high data integrity with some speed, so I > thought of the following layout: > > - both hdd and ssd in one LVM VG > - one LV on each hdd, containing a btrfs filesystem > - both btrfs LV configured as RAID1 > - the single SDD used as a LVM cache device for both HDD LVs to speed up > random access, where possible > > Now, I wonder if that is a good architecture to go for. Any input on > that? Is btrfs the right way to go for, or should I better go for ZFS > (and purchase some more gigs of RAM)? > > Will there be any problems arising from the lvmcache? btrfs only sees > the HDDs, LVM does the SDD handling. > > Thanks for any input. I like btrfs very much, but data integrity is > important for this. > > Michael > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs und lvm-cache?
On Wed, Dec 23, 2015 at 1:21 PM, Neuer Userwrote: > Am 23.12.2015 um 20:49 schrieb Chris Murphy: >> Seems to me if the LV's on the two HDDs are exposed, the lvmcache has >> to separately keep track of those LVs. So as long as everything is >> working correctly, it should be fine. That includes either transient >> or persistent, but consistent, errors for either HDD or the SSD, and >> Btrfs can fix up those bad reads with data from the other. If the SSD >> were to decide to go nutty, chances are reads through lvmcache would >> be corrupt no matter what LV is being read by Btrfs, and it'll be >> aware of that and discard those reads. Any corrupt writes in this >> case, won't be immediately known by Btrfs because it (like any file >> system) assumes writes are OK unless the device reports a write >> failure, but those too would be found on read. > > What corrupt write do you mean? The "nuts" SSD is not going to write to > the HDDs, that will be done by lvmcache. So the HDDs should get the > correct data, only the SSD will be bad, right? Btrfs always writes to the 'cache LV' and then it's up to lvmcache to determine how and when things are written to the 'cache pool LV' vs the 'origin LV' and I have no idea if there's a case with writeback mode where things write to the SSD and only later get copied from SSD to the HDD, in which case a wildly misbehaving SSD might corrupt data on the origin. If you use writethrough, the default, then the data on HDDs should be fine even if the single SSD goes crazy for some reason. Even if all reads go bad, worse case is Btrfs should stop and go read-only. If the SSD read errors are more transient, then Btrfs tries to fix them with COW writes, so even if these fixes aren't needed on HDD, they should arrive safely on both HDDs and hence still no corruption. I mean *really* if data integrity is paramount you probably would do this with production methods. Anything that has high IOPS like a mail server, just write that stuff only to the SSD, and then occasionally rsync it to conventionally raided (md or lvm) HDDs with XFS. You could even use lvm snapshots and do this often, and now you not only have something fast and safe but also you have an integrated backup that's mirrored, in a sense you have three copies. Whereas what you're attempting is rather complicated, and while it ought to work and it gets testing, you're really being a test candidate not least of which is Btrfs but also lvmcache, but you're also combining both tests. I'd just say make sure you have regular backups - snapshot the rw subvolume regularly and sync it to another filesystem. As often as the workflow can tolerate. > > And that would become obvious with the next reads, in which case btrfs > probably would throw an error as it gets crazy data from apparently both > LVs (but only coming from the SSD). So, that could be fixed by removing > the SSD without any data loss from the HDDs, right? Only if you're using writethrough mode, but yes. > >> >> The question I have, that I don't know the answer to, is if the stack >> arrives at a point where all writes are corrupt but hardware isn't >> reporting write errors, and it continues to happen for a while, once >> you've resolved that problem and try to mount the file system again, >> how well does Btrfs disregard all those bad writes? How well would any >> filesystem? >> > Hmm, again the writes to the HDDs should be ok. Only the SSD would have > pretty corrupt data, right? In such a case it might depend on how much > bad data is read back from the SSDs and what the filesystem does in > raction to these? > > P.S.: Of course, one other possibility would be to use a second SSD, so > that each LV has a separate caching SSD. In this case, there would > always be a valid source (given that not both SSDs go nuts the same > time...). Simplistically, SSDs seem to fail two ways: a series of transient errors that Btrfs can pretty much always account for; and then totally face planting. The way they faceplant can be all writes fail, reads work, or the whole device just vanishes off the bus. I don't know how that affects lvmcache writethrough if the entire cache pool vanishes. It should still write to the HDDs but I don't know that it does. > But I would need another slot for this. If the pros are very high, > that's ok. If it works nicely with just one SSD, then even better. Yeah if it's a decent name brand SSD and not one of the ones with known crap firmware, then I think it's fine to just have one. Either way, each origin LV gets a separate cache pool LV if I understand lvmcache correctly. Off hand I don't know if you need separate VGs to make sure the 'cache LVs' you format with Btrfs in fact use different PVs as origins. That's important. The usual lvcreate command has a way to specify one or more PVs to use, rather than have it just grab a pile of extents from the VG (which could be from either PV), but I don't know if that's the way
Re: [PATCH 1/2] fstests: fix btrfs test failures after commit 27d077ec0bda
On Tue, Dec 22, 2015 at 02:22:40AM +, fdman...@kernel.org wrote: > From: Filipe Manana> > Commit 27d077ec0bda (common: use mount/umount helpers everywhere) made > a few btrfs test fail for 2 different reasons: > > 1) Some tests (btrfs/029 and btrfs/031) use $SCRATCH_MNT as a mount >point for some subvolume created in $TEST_DEV, therefore calling >_scratch_unmount does not work as it passes $SCRATCH_DEV as the >argument to the umount program. This is intentional to test reflinks >accross different mountpoints of the same filesystem but for different >subvolumes; The correct way to fix this is to stop abusing $SCRATCH_MNT and to instead use a local mount point on the test device > 2) For multiple devices filesystems (btrfs/003 and btrfs/011) that test >the device replace feature, we need to unmount using the mount path >($SCRATCH_MNT) because unmounting using one of the devices as an >argument ($SCRATCH_DEV) does not always work - after replace operations >we get in /proc/mounts a device other than $SCRATCH_DEV associated >with the mount point $SCRATCH_MNT (this is mentioned in a comment at >btrfs/011 for example), so we need to pass that other device to the >umount program or pass it the mount point. Which says to that _scratch_unmount should be using $SCRATCH_MNT rather than $SCRATCH_DEV. That would fix the problem without needing to modify any of the tests, right? > Using $SCRATCH_MNT as a mountpoint for a device other than $SCRATCH_DEV is > misleading, but that's a different problem that existed long before and > this change attempts only to fix the regression from 27d077ec0bda. It may be misleading, but that's the fundamental problem that needs fixing. As always, we should be trying to fix the root cause of the problem, not working around them... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Loss of connection to Half of the drives
Goffredo Baroncelli posted on Wed, 23 Dec 2015 19:20:32 +0100 as excerpted: > Ducan talked about a N-way mirroring, where each disks contains a copy > of the same data. Nobody talked about N-way mirroring where N is less > than the number of the available disks. Well, to be fair, I did /try/ to talk about raid10 in the context of N- way-mirroring, as *one*future*option*, which would let you do say 3-way- mirroring, 2-way-striping, using six devices, giving you that choice in addition to the current 3-way-striping, 2-way-mirroring, that's the only current choice for btrfs raid10 with six devices, since it's limited to two-way-mirroring. But obviously I was more confusing than clear, since you apparently didn't see that bit at all, and he saw it, but apparently ended up more confused than helped by it, possibly due to trying to apply that discussion to a larger scope than the limited one-future-option scope that I had originally intended. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid 5/6 Stability
There's a worthwhile distinction between stability of raid56 vs all other profiles, and btrfs multiple device failure behavior. Right now there's no monitoring or notification of failures to user space. In fact Btrfs itself doesn't really understand device failures, a device can spit out many read or write errors and Btrfs keeps trying to read and write. So there's no equivalent to faultiness like with md/mdadm. Therefore you'll have to figure out a way to monitor kernel messages, maybe via a script that parses for btrfs messages and emails any such messages ever 10m or whatever. Chris Murphy. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid 5/6 Stability
Chris Murphy posted on Wed, 23 Dec 2015 19:38:23 -0700 as excerpted: > There's a worthwhile distinction between stability of raid56 vs all > other profiles, and btrfs multiple device failure behavior. Right now > there's no monitoring or notification of failures to user space. In > fact Btrfs itself doesn't really understand device failures, a device > can spit out many read or write errors and Btrfs keeps trying to read > and write. So there's no equivalent to faultiness like with md/mdadm. > Therefore you'll have to figure out a way to monitor kernel messages, > maybe via a script that parses for btrfs messages and emails any such > messages ever 10m or whatever. Absolutely. Raid56 mode may be stabilizing, but there's still no user- side multi-device filesystem health monitoring application, either for raid56 or in general, for the raid1/10 modes which are in fact reasonably stable and mature on btrfs and have been considered at the level of btrfs itself for quite awhile (several years), now. Thanks for that addendum, Chris. It could be quite helpful to someone just setting up a new installation, particularly on a server where the user and/or admin is unlikely to be directly observing things and thus know when things go wrong due to the observed change in behavior, regardless of formal monitoring or the lack thereof, as would likely be the case on a desktop/workstation. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs und lvm-cache?
Neuer User posted on Wed, 23 Dec 2015 11:45:28 +0100 as excerpted: > - both hdd and ssd in one LVM VG > - one LV on each hdd, containing a btrfs filesystem > - both btrfs LV configured as RAID1 > - the single SDD used as a LVM cache device for both HDD LVs to speed up > random access, where possible I'll let others debate the lvm-cache details, which I don't know much about, but I do have a couple points to add, one of which is detail, one rather higher level. The higher level one first: 1) While I've seen both bcache and lvm-cache discussed as potential options here, there is at least one user using bcache on top of btrfs that posts to bcache-related threads here with some regularity. While there were some serious bugs to work thru early on, his recent posts suggest current bcache works very well with current btrfs, and given that he has posted to several threads with some time separation between them, he does appear to be a regular here, and I expect he'd be posting pretty fast if things started going buggy for him once again. There hasn't been a corresponding regular poster here using lvm-cache, so while it may work well, we don't know that. At minimum, postings thus suggest that bcache on btrfs is a better tested solution at this point, and thus, would be recommended, while lvm-cache on btrfs, while an equally valid technical choice in theory, doesn't have much if any real-world data going for it at this point, and is thus in practice an unknown. 2) Not being the person using bcache and not being familiar with it or lvm-cache personally, I don't know how either one handle btrfs multi-device. However, it occurs to me that if it's necessary, in addition to the multiple ssds suggested by the others to cover such multi-device caching, you should also be able to partition up the ssd, and use each partition as an individual device cache. That's almost certainly what I'd do here if I needed to (except that above a certain size, ssd prices per GiB start to go up dramatically, so if I wanted total ssd cache sizes above that I'd of course pay less for multiple smaller ssds again) instead of fiddling with multiple physical ssds, but again, not knowing how the caching works, I'm not sure if multiple cache devices would be needed to cache a multi-device btrfs at the back end, or not, so I don't know whether I'd need to bother with such partitioning or not. The key here is that on ssds, seek time is zero anyway, so partitioning up the ssd and using both partitions as cache doesn't have the latency issues that attempting to do something like that (or for example btrfs raid1 on two partitions on the same physical device) would have on spinning rust. I thought I'd throw those points out, in case you had failed to notice bcache as an option and would prefer it as better tested, once you knew about it, and in case the partitioned ssd idea does help with the multi-device btrfs caching thing. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Loss of connection to Half of the drives
Donald Pearson posted on Wed, 23 Dec 2015 09:53:41 -0600 as excerpted: > Additionally real Raid10 will run circles around what BTRFS is doing in > terms of performance. In the 20 drive array you're striping across 10 > drives, in BTRFS right now you're striping across 2 no matter what. So > not only do I lose in terms of resilience I lose in terms of > performance. I assume that N-way-mirroring used with BTRFS Raid10 will > also increase the stripe width so that will level out the performance > but you're always going to be short a drive for equal resilience. No, with btrfs raid10, you're /mirroring/ across two drives no matter what. With 20 devices, you're /striping/ across 10 two-way mirrors. It's the same as a standard raid10, in that regard. Tho it's a bit different in that the mix of devices forming the above can differ among different chunks. IOW, the first chunk might be mirrored a/ b c/d e/f g/h i/j k/l m/n o/p q/r s/t, with the stripe across each mirror- pair, but the chunk might be mirrored a/l g/o f/k b/n c/d e/s j/q h/t i/p m/r (I think I got each letter once...), and striped across those pairs. So you get the same performance as a normal raid10 (well, to the extent that btrfs has been optimized, which in large part it hasn't been, yet), but as should always be the case in a raid10, randomized loss of more than a single device can mean data loss. But, because each chunk pair assignment is more or less randomized, unlike a conventional raid10 which lets you map all of one mirror set to one cabinet and all of the second mirror set to another cabinet, so you can reliably lose an entire cabinet and be fine since it's known to correspond exactly to a single mirror set, you can't do that with btrfs raid10, because there's no way to specify individual chunk mirroring and what might be precisely one mirror set with one chunk, is very likely to be both copies of some mirrors and no copies of other mirrors, with another chunk. What I was suggesting as a solution was a setup that: (a) has btrfs raid1 at the top level (b) has a pair of mdraidNs underneath, in this case a pair of 10-device mdraid10s. (c) has the pair of mdraidNs each presented to btrfs as one of its raid1 mirrors. While this is actually raid01, not raid10, in this case it makes more sense than a mixed raid10, because by doing it that way, you'd: 1) keep btrfs' data integrity and error correction at the top level, as it could pull from the second copy if the first failed checksum. 2) be able to stick each mdraid0 in its own cabinet, so loss of the entire cabinet wouldn't be data loss, only redundancy loss. (Reversing that, btrfs raid0 on top of mdraid1, would lose btrfs' ability to correct checksum errors as at the btrfs level, it'd be non-redundant, and mdraid1 doesn't have checksumming, so it couldn't provide the same data integrity service. Without checksumming and pull from the other copy in case of error, you could scrub the mdraid1 to make its mirrors identical again, but you'd be just as likely to copy the bad one to the good one as the reverse. Thus, btrfs really needs to be the raid1 layer unless you simply don't care about data integrity, and because btrfs is the filesystem layer, it has to be the top layer, so you're left doing a raid01 instead of the raid10 that's ordinarily preferred due to locality of a rebuild, absent other factors like this data integrity factor.) And what btrfs N-way-mirroring will provide, in the longer term once btrfs gets that feature and it stabilizes to usability, is the ability to actually have three cabinets, and sustain the loss of two, or four cabinets, and sustain the loss of three, etc. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid 5/6 Stability
jwalmer posted on Wed, 23 Dec 2015 17:52:10 -0500 as excerpted: > Just an avid follower of the project checking in. It has been about nine > months since the initial Raid 5/6 features were released in 3.19 and > they are still listed as incomplete/experimental on the Wiki. > > Admittedly, I don't understand how such a large and distributed project > prioritizes features for development, but I haven't been able to find a > clear roadmap anywhere. > > I'm wondering if anyone here is able to give me some insight about when > the Raid 5/6 feature will next be updated, or even when they are > scheduled to lose their incomplete/experimental designation. Addressing the wiki side first, then the question you're probably more interested in. =:^) FWIW, the wiki gets updated... when a volunteer (which could be you =:^) updates it. It often has quite current information... somewhere on the wiki, but often not all mentions of a feature get updated at the same time, and some may lag behind. That said, while btrfs raid56 is no longer experimental, I'd not call it entirely stable, even to the point of the rest of btrfs (which is stabilizing but not fully stable or mature yet), just yet. I've personally long stated that raid56 feature stability, to the point of the rest of btrfs anyway, can be expected roughly a year after nominal feature completion, with an additional requirement of at least two kernel cycles without major bugs in the feature. At five kernel releases a year that would put it more or less at 4.4, which is soon to be released and quite good timing, as 4.4 is an LTS release, and indeed, the last major raid56 bug was fixed early in the 4.2 cycle (well before 4.2 release), so 4.4 meets the requirement in that regard as well. =:^) Now I'm just an active list regular and btrfs user, not a dev, but I began making that recommendation/prediction before 3.19's release, when it was clear 3.19 would bring nominal raid56 code completion, and in the immediately following releases as well, when people were (I thought) jumping the gun, and indeed, getting their data eaten by remaining critical bugs, and nobody has argued it otherwise in the intervening time, so I'd suggest it's a reasonably solid recommendation. So 4.4 is what I'd consider the magical raid56-stability release, and I'd actually expect the wiki to be updated shortly thereafter, tho 4.4 is close enough now, and there have been no major raid56 bugs reported in the 4.3 and 4.4 cycles, that arguably the wiki's raid56 status could be updated now to reflect that. (Personally, I'm more a newsgroups and mailing lists guy, and while I read web/wiki resources and will in fact often quote them, I tend to treat them as read-only and very seldom personally edit them, leaving that to others, who occasionally even quote my list posts more or less verbatim when they update the wiki. So again, you're invited to do so if that's your thing, but it's nothing I'm likely to do personally. And FWIW, there are a few folks that watch wiki updates and revert spam and anything crazy, so as long as the edits are honestly trying to make things better, any help you can be in editing the wiki is highly appreciated, and you don't have to worry too much about any mistakes you inadvertently make, as others will be along to fix them. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs und lvm-cache?
On 12/23/15 21:07, Neuer User wrote: > Understood. However, do SSDs really do automatic deduplication? I might > be completely wrong here, but that sounds to be a rather complex > mechanism, requiring lots of RAM to deduplicate 100 GB. I wouldn't have > thought that typical SSDs include that? tl;dr: no, because delta encoding/write buffer coalescing is not dedupe. This is one of those persistent myth that has been kept alive by the internet rumor machine. It has its roots in a series of blog articles [1] and turned out to be panic coupled with FUD and fueled by a lack of factual information. I suggest everyone read the article(s), ALL the comments and then get back to drinking. :o) In SSD arrays dedupe is generally seen as a good thing. -h [1] http://storagemojo.com/2011/06/27/de-dup-too-much-of-good-thing/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Loss of connection to Half of the drives
On Wed, Dec 23, 2015 at 3:15 PM, Donald Pearsonwrote: > On Wed, Dec 23, 2015 at 12:20 PM, Goffredo Baroncelli >> Ducan talked about a N-way mirroring, where each disks contains a copy of >> the same data. Nobody talked about N-way mirroring where N is less than the >> number of the available disks. >> > > Well that was certainly implied as the unimplemented solution to > dropping half the drives that the OP tested. N-way mirroring where N > = the number of drives is just Raid1 on crack and not the Raid10 > use-case that the OP is asking about. How does the OP's use case normally get implemented? For separate controllers, this would need to be software raid10, but you'd need a way to specify the drive pairings. How does mdadm create -l raid10 enable that? Or to make absolutely certain, do you put them all in a container and then first create -l raid1, and then second create -l raid0? In any case, what you get is drive level granularity for mirroring. A drive has an exact (excluding layout options, but still data exact) copy. That's not true with Btrfs where the granularity is the data chunk (1+GiB). A given drive's chunks will definitely have copies on multiple drives rather than on a single drive. And those multiple drives will variably be on both sides of a controller or drive make/model division. One of the major differences of Btrfs with all profiles is that it deals with different sized devices elegantly. That's because of the chunk level granularity. So I think that having mirrors of drives rather than chunks means that we have to have exact size drive pairings. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas on unified real-ro mount option across all filesystems
Eric Sandeenwrites: >> 3) A lot of user even don't now mount ro can still modify device >>Yes, I didn't know this point until I checked the log replay code of >>btrfs. >>Adding such mount option alias may raise some attention of users. > > Given that nothing in the documentation implies that the block device itself > must remain unchanged on a read-only mount, I don't see any problem which > needs fixing. MS_RDONLY rejects user IO; that's all. > > If you want to be sure your block device rejects all IO for forensics or > what have you, I'd suggest # blockdev --setro /dev/whatever prior to mount, > and take it out of the filesystem's control. Or better yet, making an > image and not touching the original. What we do for the petitboot bootloader in POWER and OpenPower firmware (a linux+initramfs that does kexec to boot) is that we use device mapper to make a snapshot in memory where we run recovery (for some filesystems, notably XFS is different due to journal not being endian safe). We also have to have an option *not* to do that, just in case there's a bug in journal replay... and we're lucky in the fact that we probably do have enough memory to complete replay, this solution could be completely impossible on lower memory machines. As such, I believe we're the only bit of firmware/bootloader ever that has correctly parsed a journalling filesystem. -- Stewart Smith signature.asc Description: PGP signature
Re: Loss of connection to Half of the drives
On Wed, Dec 23, 2015 at 12:20 PM, Goffredo Baroncelliwrote: > On 2015-12-23 16:53, Donald Pearson wrote: > [...] >> >> Additionally real Raid10 will run circles around what BTRFS is doing >> in terms of performance. In the 20 drive array you're striping across >> 10 drives, in BTRFS right now you're striping across 2 no matter what. >> So not only do I lose in terms of resilience I lose in terms of >> performance. I assume that N-way-mirroring used with BTRFS Raid10 >> will also increase the stripe width so that will level out the >> performance but you're always going to be short a drive for equal >> resilience. > > In case of RAID10,on the best of my knowledge, BTRFS allocate each CHUNK > across *all* the available devices. It uses the usual RAID0 (==striping) over > a RAID1 (mirroring). > > What you are describing is the BTRFS RAID1; i.e. LINEAR over a RAID1:each > chunk is allocated in *two*, only *two* different disks from the disks pool; > the disks are the ones with the largest free space. Each chunk may be > allocated on a different *pair* of disks. > Okay so however the chunk is divided up, 2 copies of each chunk division is written somewhere. So I misunderstood, thanks for clearing it up! >> And finally the elephant in the room that comes with the necessary >> 11-way mirroring is that the usable capacity of that 20 drive array. >> Remember, pea brain so my math may be wrong in application and >> calculation but if it's made of 1T drives for 20T raw, there is only >> 1.82T usable (20 / 11) and if I'm completely off in that figure the >> point is still that such a high level of mirroring is going to >> excessively consume drive space. > > Ducan talked about a N-way mirroring, where each disks contains a copy of the > same data. Nobody talked about N-way mirroring where N is less than the > number of the available disks. > Well that was certainly implied as the unimplemented solution to dropping half the drives that the OP tested. N-way mirroring where N = the number of drives is just Raid1 on crack and not the Raid10 use-case that the OP is asking about. > To be honest in the past appeared some patches to implement a generalized > RAID-NxM raid, where N are the total disk, M are the redundancy disks: i.e. > the filesystem could allow a drop of M disks (see > http://www.spinics.net/lists/linux-btrfs/msg29245.html). > > BR > G.Baroncelli > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 Yeah that whole thing is pretty upsetting. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas to do custom operation just after mount?
On Mon, Dec 21, 2015 at 01:18:22PM +0800, Anand Jain wrote: > > > >BTW, any good idea for btrfs to do such operation like > >enabling/disabling some minor features? Especially when it can be set on > >individual file/dirs. > > > >Features like incoming write time deduplication, is designed to be > >enabled/disabled for individual file/dirs, so it's not a quite good idea > >to use mount option to do it. > > > >Although some feature, like btrfs quota(qgroup), should be implemented > >by mount option though. > >I don't understand why qgroup is enabled/disabled by ioctl. :( > > > mount option won't persist across systems/computers unless > remembered by human. So record the mount option you want persistent in the filesystem at mount time and don't turn it off until a "no-" mount option is provided at mount time. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs und lvm-cache?
On Wed, Dec 23, 2015 at 1:24 PM, Neuer Userwrote: > One other thing: > > I read that btrfs has some options that are turned off for SSDs as they > might be harmful or so. In my case btrfs, however, would not know about > the SSD and probably use its HDD optimized settings. The result, > however, would be forwared also to the SSD via lvmcache. Do I see that > right? Would that give any serious problems? No, Btrfs is fine for SSD with or without the optimization. And with optimization is OK for hard drives also. I think you're unlikely to notice any difference, but you can test it if you want with mount options ssd or nossd, depending on how the cache LV is detected (I'd guess it's detected as non rotational so ssd option is default). -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Raid 5/6 Stability
Hello dev crew, Just an avid follower of the project checking in. It has been about nine months since the initial Raid 5/6 features were released in 3.19 and they are still listed as incomplete/experimental on the Wiki. Admittedly, I don't understand how such a large and distributed project prioritizes features for development, but I haven't been able to find a clear roadmap anywhere. I'm wondering if anyone here is able to give me some insight about when the Raid 5/6 feature will next be updated, or even when they are scheduled to lose their incomplete/experimental designation. Thanks!-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs und lvm-cache?
On Wed, Dec 23, 2015 at 6:38 AM, Neuer Userwrote: > Am 23.12.2015 um 12:21 schrieb Martin Steigerwald: >> Hi. >> >> As far as I understand this way you basically loose the RAID 1 semantics of >> BTRFS. While the data is redundant on the HDDs, it is not redundant on the >> SSD. It may work for a pure read cache, but for write-through you definately >> loose any data integrity protection a RAID 1 gives you. >> > Hmm, are you sure? I thought LVM lies underneath btrfs. Btrfs thus > should not know about the caching SSD at all. It only knows of the two > LVs on the HDDs, reading and writing data from or to one or both of the > two LVs. I believe Martin's concern is two-fold: The first, major issue, concerns the default writeback cache mode, which makes the SSD a single point of failure. (in writeback mode, a write to a block that is cached will go only to the cache and the block will be marked dirty in the metadata.) If the SSD fails with dirty data in the cache which has not been flushed to the backing devices, the filesystem may be in a unrecoverable state, because writes which BTRFS was told had succeeded are not present on disk. The second potential issue is that if the SSD performs internal deduplication, the two copies of cached data (contents on drive 1, content on drive 2) may actually be a reference to the same bits of internal storage, meaning a single corruption will affect both cached copies. If in writeback, then corrupted data could flush down to both disks. I'm not sure what would happen in writethrough. ~ Noah -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs und lvm-cache?
Am 23.12.2015 um 20:45 schrieb Noah Massey: > On Wed, Dec 23, 2015 at 6:38 AM, Neuer Userwrote: > I believe Martin's concern is two-fold: > > The first, major issue, concerns the default writeback cache mode, > which makes the SSD a single point of failure. > (in writeback mode, a write to a block that is cached will go only to > the cache and the block > will be marked dirty in the metadata.) If the SSD fails with dirty > data in the cache which has not been flushed to the backing devices, > the filesystem may be in a unrecoverable state, because writes which > BTRFS was told had succeeded are not present on disk. Ok, I see. Would it help, if the cache were set to writethrough then? In this case the data on the hdds should be always ok, right? (At least as long as the hdds are fine.) > > The second potential issue is that if the SSD performs internal > deduplication, the two copies of cached data (contents on drive 1, > content on drive 2) may actually be a reference to the same bits of > internal storage, meaning a single corruption will affect both cached > copies. If in writeback, then corrupted data could flush down to both > disks. I'm not sure what would happen in writethrough. > Understood. However, do SSDs really do automatic deduplication? I might be completely wrong here, but that sounds to be a rather complex mechanism, requiring lots of RAM to deduplicate 100 GB. I wouldn't have thought that typical SSDs include that? > ~ Noah > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs und lvm-cache?
Am 23.12.2015 um 20:49 schrieb Chris Murphy: > Seems to me if the LV's on the two HDDs are exposed, the lvmcache has > to separately keep track of those LVs. So as long as everything is > working correctly, it should be fine. That includes either transient > or persistent, but consistent, errors for either HDD or the SSD, and > Btrfs can fix up those bad reads with data from the other. If the SSD > were to decide to go nutty, chances are reads through lvmcache would > be corrupt no matter what LV is being read by Btrfs, and it'll be > aware of that and discard those reads. Any corrupt writes in this > case, won't be immediately known by Btrfs because it (like any file > system) assumes writes are OK unless the device reports a write > failure, but those too would be found on read. What corrupt write do you mean? The "nuts" SSD is not going to write to the HDDs, that will be done by lvmcache. So the HDDs should get the correct data, only the SSD will be bad, right? And that would become obvious with the next reads, in which case btrfs probably would throw an error as it gets crazy data from apparently both LVs (but only coming from the SSD). So, that could be fixed by removing the SSD without any data loss from the HDDs, right? > > The question I have, that I don't know the answer to, is if the stack > arrives at a point where all writes are corrupt but hardware isn't > reporting write errors, and it continues to happen for a while, once > you've resolved that problem and try to mount the file system again, > how well does Btrfs disregard all those bad writes? How well would any > filesystem? > Hmm, again the writes to the HDDs should be ok. Only the SSD would have pretty corrupt data, right? In such a case it might depend on how much bad data is read back from the SSDs and what the filesystem does in raction to these? P.S.: Of course, one other possibility would be to use a second SSD, so that each LV has a separate caching SSD. In this case, there would always be a valid source (given that not both SSDs go nuts the same time...). But I would need another slot for this. If the pros are very high, that's ok. If it works nicely with just one SSD, then even better. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs und lvm-cache?
Hello I want to setup a small homeserver, based on a HP Microserver Gen8 (4GB RAM, 2x3TB HDD + 1x120GB SSD) and Proxmox as distro. The server will be used to host a (small) number of virtual machines, most of them being LXC containers, few being KVM machines. One of the LXC containers will host a fileserver with app 1 TB of data and another one a backup system for the desktops / laptops in my household, thus probably holding quite a lot of files. The lxc containers will use the filesystem of the proxmox host, the KVM machines probably raw disk files (or qcow2). I would like to combine high data integrity with some speed, so I thought of the following layout: - both hdd and ssd in one LVM VG - one LV on each hdd, containing a btrfs filesystem - both btrfs LV configured as RAID1 - the single SDD used as a LVM cache device for both HDD LVs to speed up random access, where possible Now, I wonder if that is a good architecture to go for. Any input on that? Is btrfs the right way to go for, or should I better go for ZFS (and purchase some more gigs of RAM)? Will there be any problems arising from the lvmcache? btrfs only sees the HDDs, LVM does the SDD handling. Thanks for any input. I like btrfs very much, but data integrity is important for this. Michael -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html