On Sun, Jun 05, 2016 at 10:56:45PM +0200, Christoph Anton Mitterer wrote:
> On Sun, 2016-06-05 at 22:39 +0200, Henk Slager wrote:
> > > So the point I'm trying to make:
> > > People do probably not care so much whether their VM image/etc. is
> > > COWed or not, snapshots/etc. still work with that,... but they may
> > > likely care if the integrity feature is lost.
> > > So IMHO, nodatacow + checksumming deserves to be amongst the top
> > > priorities.
> > Have you tried blockdevice/HDD caching like bcache or dmcache in
> > combination with VMs on BTRFS?
> No yet,... my personal use case is just some VMs on the notebook, and
> for this, the above would seem a bit overkill.
> For the larger VM cluster at the institute,... puh to be honest I don't
> know by hard what we do there.
> 
> 
> >   Or ZVOL for VMs in ZFS with L2ARC?
> Well but all this is an alternative solution,...
> 
> 
> > I assume the primary reason for wanting nodatacow + checksumming is
> > to
> > avoid long seektimes on HDDs due to growing fragmentation of the VM
> > images over time.
> Well the primary reason is wanting to have overall checksumming in the
> fs, regardless of which features one uses.

   The problem is that you can't guarantee consistency with
nodatacow+checksums. If you have nodatacow, then data is overwritten,
in place. If you do that, then you can't have a fully consistent
checksum -- there are always race conditions between the checksum and
the data being written (or the data and the checksum, depending on
which way round you do it).

> I think we already have some situations where tools use/set btrfs
> features by themselves (i.e. automatically)... wasn't systemd creating
> subvols per default in some locations, when there's btrfs?
> So it's no big step to postgresql/etc. setting nodatacow, making people
> loose integrity without them even knowing.
> 
> Of course, avoiding the fragmentation is the reason for the desire to
> have nodatacow.
> 
> 
> >  But even if you have nodatacow + checksumming
> > implemented, it is then still HDD access and a VM imagefile itself is
> > not guaranteed to be continuous.
> Uhm... sure, but that's no difference to other filesystems?!
> 
> 
> > It is clear that for VM images the amount of extents will be large
> > over time (like 50k or so, autodefrag on),
> Wasn't it said, that autodefrag performs bad for anything larger than
> ~1G?

   I don't recall ever seeing someone saying that. Of course, I may
have forgotten seeing it...

> >  but with a modern SSD used
> > as cache, it doesn't matter. It is still way faster than just HDD(s),
> > even with freshly copied image with <100 extents.
> Well the fragmentation has also many other consequences and not just
> seeks (assuming everyone would use SSDs, which is and probably won't be
> the case for quite a while).
> Most obviously you get much more IOPS and btrfs itself will, AFAIU,
> also suffer from some issues due to the fragmentation.

   This is a fundamental problem with all CoW filesystems. There are
some mititgations that can be put in place (true CoW rather than
btrfs's redirect-on-write, like some databases do, where the original
data is copied elsewhere before overwriting; cache aggressively and
with knowledge of the CoW nature of the FS, like ZFS does), but they
all have their drawbacks and pathological cases.

   Hugo.

-- 
Hugo Mills             | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4          |                                        Harry Harrison

Attachment: signature.asc
Description: Digital signature

Reply via email to