...

 just rearrange your blocks sensibly -
> and to at least some degree you could do that while
> they're still cache-resident

Lots of discussion has passed under the bridge since that observation above, 
but it may have contained the core of a virtually free solution:  let your 
table become fragmented, but each time that a sequential scan is performed on 
it determine whether the region that you're currently scanning is 
*sufficiently* fragmented that you should retain the sequential blocks that 
you've just had to access anyway in cache until you've built up around 1 MB of 
them and then (in a background thread) flush the result contiguously back to a 
new location in a single bulk 'update' that changes only their location rather 
than their contents.

1.  You don't incur any extra reads, since you were reading sequentially anyway 
and already have the relevant blocks in cache.  Yes, if you had reorganized 
earlier in the background the current scan would have gone faster, but if scans 
occur sufficiently frequently for their performance to be a significant issue 
then the *previous* scan will probably not have left things *all* that 
fragmented.  This is why you choose a fragmentation threshold to trigger reorg 
rather than just do it whenever there's any fragmentation at all, since the 
latter would probably not be cost-effective in some circumstances; conversely, 
if you only perform sequential scans once in a blue moon, every one may be 
completely fragmented but it probably wouldn't have been worth defragmenting 
constantly in the background to avoid this, and the occasional reorg triggered 
by the rare scan won't constitute enough additional overhead to justify heroic 
efforts to avoid it.  Such a 'threshold' is a crude but possi
 bly adequate metric; a better but more complex one would perhaps nudge up the 
threshold value every time a sequential scan took place without an intervening 
update, such that rarely-updated but frequently-scanned files would eventually 
approach full contiguity, and an even finer-grained metric would maintain such 
information about each individual *region* in a file, but absent evidence that 
the single, crude, unchanging threshold (probably set to defragment moderately 
aggressively - e.g., whenever it takes more than 3 or 5 disk seeks to inhale a 
1 MB region) is inadequate these sound a bit like over-kill.

2.  You don't defragment data that's never sequentially scanned, avoiding 
unnecessary system activity and snapshot space consumption.

3.  You still incur additional snapshot overhead for data that you do decide to 
defragment for each block that hadn't already been modified since the most 
recent snapshot, but performing the local reorg as a batch operation means that 
only a single copy of all affected ancestor blocks will wind up in the snapshot 
due to the reorg (rather than potentially multiple copies in multiple snapshots 
if snapshots were frequent and movement was performed one block at a time).

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to