> This question triggered some silly questions in my
> mind:

Actually, they're not silly at all.

> 
> Lots of folks are determined that the whole COW to
> different locations 
> are a Bad Thing(tm), and in some cases, I guess it
> might actually be...
> 
> What if ZFS had a pool / filesystem property that
> caused zfs to do a 
> journaled, but non-COW update so the data's relative
> location for 
> databases is always the same?

That's just what a conventional file system (no need even for a journal, when 
you're updating in place) does when it's not guaranteeing write atomicity (you 
address the latter below).

> 
> Or - What if it did a double update: One to a staged
> area, and another 
> immediately after that to the 'old' data blocks.
> Still always have 
> on-disk consistency etc, at a cost of double the
> I/O's...

It only requires an extra disk access if the new data is too large to dump 
right into the journal itself (which guarantees that the subsequent in-place 
update can complete).  Whether the new data is dumped into the log or into a 
temporary location the pointer to which is logged, the subsequent in-place 
update can be deferred until it's convenient (e.g., until after any additional 
updates to the same data have also been accumulated, activity has cooled off, 
and the modified blocks are getting ready to be evicted from the system cache - 
and, optionally, until the target disks are idle or have their heads positioned 
conveniently near the target location).

ZFS's small-synchronous-write log can do something similar as long as the 
writes aren't too large to place in it.  However, data that's only persisted in 
the journal isn't accessible via the normal snapshot mechanisms (well, if an 
entire file block was dumped into the journal I guess it could be, at the cost 
of some additional complexity in journal space reuse), so I'm guessing that ZFS 
writes back any dirty data that's in the small-update journal whenever a 
snapshot is created.

And if you start actually updating in place as described above, then you can't 
use ZFS-style snapshotting at all:  instead of capturing the current state as 
the snapshot with the knowledge that any subsequent updates will not disturb 
it, you have to capture the old state that you're about to over-write and stuff 
it somewhere else - and then figure out how to maintain appropriate access to 
it while the rest of the system moves on.

Snapshots make life a lot more complex for file systems than it used to be, and 
COW techniques make snapshotting easy at the expense of normal run-time 
performance - not just because they make update-in-place infeasible for 
preserving on-disk contiguity but because of the significant increase in disk 
bandwidth (and snapshot storage space) required to write back changes all the 
way up to whatever root structure is applicable:  I suspect that ZFS does this 
on every synchronous update save for those that it can leave temporarily in its 
small-update journal, and it *has* to do it whenever a snapshot is created.

> 
> Of course, both of these would require non-sparse
> file creation for the 
> DB etc, but would it be plausible?

Update-in-place files can still be sparse:  it's only data that already exists 
that must be present (and updated in place to preserve sequential access 
performance to it).

> 
> For very read intensive and position sensitive
> applications, I guess 
> this sort of capability might make a difference?

No question about it.  And sequential table scans in databases are among the 
most significant examples, because (unlike things like streaming video files 
which just get laid down initially and non-synchronously in a manner that at 
least potentially allows ZFS to accumulate them in large, contiguous chunks - 
though ISTR some discussion about just how well ZFS managed this when it was 
accommodating multiple such write streams in parallel) the tables are also 
subject to fine-grained, often-random update activity.

Background defragmentation can help, though it generates a boatload of 
additional space overhead in any applicable snapshot.

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to