... just rearrange your blocks sensibly - > and to at least some degree you could do that while > they're still cache-resident
Lots of discussion has passed under the bridge since that observation above, but it may have contained the core of a virtually free solution: let your table become fragmented, but each time that a sequential scan is performed on it determine whether the region that you're currently scanning is *sufficiently* fragmented that you should retain the sequential blocks that you've just had to access anyway in cache until you've built up around 1 MB of them and then (in a background thread) flush the result contiguously back to a new location in a single bulk 'update' that changes only their location rather than their contents. 1. You don't incur any extra reads, since you were reading sequentially anyway and already have the relevant blocks in cache. Yes, if you had reorganized earlier in the background the current scan would have gone faster, but if scans occur sufficiently frequently for their performance to be a significant issue then the *previous* scan will probably not have left things *all* that fragmented. This is why you choose a fragmentation threshold to trigger reorg rather than just do it whenever there's any fragmentation at all, since the latter would probably not be cost-effective in some circumstances; conversely, if you only perform sequential scans once in a blue moon, every one may be completely fragmented but it probably wouldn't have been worth defragmenting constantly in the background to avoid this, and the occasional reorg triggered by the rare scan won't constitute enough additional overhead to justify heroic efforts to avoid it. Such a 'threshold' is a crude but possi bly adequate metric; a better but more complex one would perhaps nudge up the threshold value every time a sequential scan took place without an intervening update, such that rarely-updated but frequently-scanned files would eventually approach full contiguity, and an even finer-grained metric would maintain such information about each individual *region* in a file, but absent evidence that the single, crude, unchanging threshold (probably set to defragment moderately aggressively - e.g., whenever it takes more than 3 or 5 disk seeks to inhale a 1 MB region) is inadequate these sound a bit like over-kill. 2. You don't defragment data that's never sequentially scanned, avoiding unnecessary system activity and snapshot space consumption. 3. You still incur additional snapshot overhead for data that you do decide to defragment for each block that hadn't already been modified since the most recent snapshot, but performing the local reorg as a batch operation means that only a single copy of all affected ancestor blocks will wind up in the snapshot due to the reorg (rather than potentially multiple copies in multiple snapshots if snapshots were frequent and movement was performed one block at a time). - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss