On Sun, Jun 05, 2016 at 10:56:45PM +0200, Christoph Anton Mitterer wrote: > On Sun, 2016-06-05 at 22:39 +0200, Henk Slager wrote: > > > So the point I'm trying to make: > > > People do probably not care so much whether their VM image/etc. is > > > COWed or not, snapshots/etc. still work with that,... but they may > > > likely care if the integrity feature is lost. > > > So IMHO, nodatacow + checksumming deserves to be amongst the top > > > priorities. > > Have you tried blockdevice/HDD caching like bcache or dmcache in > > combination with VMs on BTRFS? > No yet,... my personal use case is just some VMs on the notebook, and > for this, the above would seem a bit overkill. > For the larger VM cluster at the institute,... puh to be honest I don't > know by hard what we do there. > > > > Or ZVOL for VMs in ZFS with L2ARC? > Well but all this is an alternative solution,... > > > > I assume the primary reason for wanting nodatacow + checksumming is > > to > > avoid long seektimes on HDDs due to growing fragmentation of the VM > > images over time. > Well the primary reason is wanting to have overall checksumming in the > fs, regardless of which features one uses.
The problem is that you can't guarantee consistency with nodatacow+checksums. If you have nodatacow, then data is overwritten, in place. If you do that, then you can't have a fully consistent checksum -- there are always race conditions between the checksum and the data being written (or the data and the checksum, depending on which way round you do it). > I think we already have some situations where tools use/set btrfs > features by themselves (i.e. automatically)... wasn't systemd creating > subvols per default in some locations, when there's btrfs? > So it's no big step to postgresql/etc. setting nodatacow, making people > loose integrity without them even knowing. > > Of course, avoiding the fragmentation is the reason for the desire to > have nodatacow. > > > > But even if you have nodatacow + checksumming > > implemented, it is then still HDD access and a VM imagefile itself is > > not guaranteed to be continuous. > Uhm... sure, but that's no difference to other filesystems?! > > > > It is clear that for VM images the amount of extents will be large > > over time (like 50k or so, autodefrag on), > Wasn't it said, that autodefrag performs bad for anything larger than > ~1G? I don't recall ever seeing someone saying that. Of course, I may have forgotten seeing it... > > but with a modern SSD used > > as cache, it doesn't matter. It is still way faster than just HDD(s), > > even with freshly copied image with <100 extents. > Well the fragmentation has also many other consequences and not just > seeks (assuming everyone would use SSDs, which is and probably won't be > the case for quite a while). > Most obviously you get much more IOPS and btrfs itself will, AFAIU, > also suffer from some issues due to the fragmentation. This is a fundamental problem with all CoW filesystems. There are some mititgations that can be put in place (true CoW rather than btrfs's redirect-on-write, like some databases do, where the original data is copied elsewhere before overwriting; cache aggressively and with knowledge of the CoW nature of the FS, like ZFS does), but they all have their drawbacks and pathological cases. Hugo. -- Hugo Mills | How do you become King? You stand in the marketplace hugo@... carfax.org.uk | and announce you're going to tax everyone. If you http://carfax.org.uk/ | get out alive, you're King. PGP: E2AB1DE4 | Harry Harrison
signature.asc
Description: Digital signature