> This question triggered some silly questions in my > mind: Actually, they're not silly at all.
> > Lots of folks are determined that the whole COW to > different locations > are a Bad Thing(tm), and in some cases, I guess it > might actually be... > > What if ZFS had a pool / filesystem property that > caused zfs to do a > journaled, but non-COW update so the data's relative > location for > databases is always the same? That's just what a conventional file system (no need even for a journal, when you're updating in place) does when it's not guaranteeing write atomicity (you address the latter below). > > Or - What if it did a double update: One to a staged > area, and another > immediately after that to the 'old' data blocks. > Still always have > on-disk consistency etc, at a cost of double the > I/O's... It only requires an extra disk access if the new data is too large to dump right into the journal itself (which guarantees that the subsequent in-place update can complete). Whether the new data is dumped into the log or into a temporary location the pointer to which is logged, the subsequent in-place update can be deferred until it's convenient (e.g., until after any additional updates to the same data have also been accumulated, activity has cooled off, and the modified blocks are getting ready to be evicted from the system cache - and, optionally, until the target disks are idle or have their heads positioned conveniently near the target location). ZFS's small-synchronous-write log can do something similar as long as the writes aren't too large to place in it. However, data that's only persisted in the journal isn't accessible via the normal snapshot mechanisms (well, if an entire file block was dumped into the journal I guess it could be, at the cost of some additional complexity in journal space reuse), so I'm guessing that ZFS writes back any dirty data that's in the small-update journal whenever a snapshot is created. And if you start actually updating in place as described above, then you can't use ZFS-style snapshotting at all: instead of capturing the current state as the snapshot with the knowledge that any subsequent updates will not disturb it, you have to capture the old state that you're about to over-write and stuff it somewhere else - and then figure out how to maintain appropriate access to it while the rest of the system moves on. Snapshots make life a lot more complex for file systems than it used to be, and COW techniques make snapshotting easy at the expense of normal run-time performance - not just because they make update-in-place infeasible for preserving on-disk contiguity but because of the significant increase in disk bandwidth (and snapshot storage space) required to write back changes all the way up to whatever root structure is applicable: I suspect that ZFS does this on every synchronous update save for those that it can leave temporarily in its small-update journal, and it *has* to do it whenever a snapshot is created. > > Of course, both of these would require non-sparse > file creation for the > DB etc, but would it be plausible? Update-in-place files can still be sparse: it's only data that already exists that must be present (and updated in place to preserve sequential access performance to it). > > For very read intensive and position sensitive > applications, I guess > this sort of capability might make a difference? No question about it. And sequential table scans in databases are among the most significant examples, because (unlike things like streaming video files which just get laid down initially and non-synchronously in a manner that at least potentially allows ZFS to accumulate them in large, contiguous chunks - though ISTR some discussion about just how well ZFS managed this when it was accommodating multiple such write streams in parallel) the tables are also subject to fine-grained, often-random update activity. Background defragmentation can help, though it generates a boatload of additional space overhead in any applicable snapshot. - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss