> can you guess? wrote: > >> For very read intensive and position sensitive > >> applications, I guess > >> this sort of capability might make a difference? > > > > No question about it. And sequential table scans > in databases > > are among the most significant examples, because > (unlike things > > like streaming video files which just get laid down > initially > > and non-synchronously in a manner that at least > potentially > > allows ZFS to accumulate them in large, contiguous > chunks - > > though ISTR some discussion about just how well ZFS > managed > > this when it was accommodating multiple such write > streams in > > parallel) the tables are also subject to > fine-grained, > > often-random update activity. > > > > Background defragmentation can help, though it > generates a > > boatload of additional space overhead in any > applicable snapshot. > > The reason that this is hard to characterize is that > there are > really two very different configurations used to > address different > performance requirements: cheap and fast. It seems > that when most > people first consider this problem, they do so from > the cheap > perspective: single disk view. Anyone who strives > for database > performance will choose the fast perspective: > stripes.
And anyone who *really* understands the situation will do both. Note: data > redundancy isn't really an issue for this analysis, > but consider it > done in real life. When you have a striped storage > device under a > file system, then the database or file system's view > of contiguous > data is not contiguous on the media. The best solution is to make the data piece-wise contiguous on the media at the appropriate granularity - which is largely determined by disk access characteristics (the following assumes that the database table is large enough to be spread across a lot of disks at moderately coarse granularity, since otherwise it's often small enough to cache in the generous amounts of RAM that are inexpensively available today). A single chunk on an (S)ATA disk today (the analysis is similar for high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size to yield over 80% of the disk's maximum possible (fully-contiguous layout) sequential streaming performance (after the overhead of an 'average' - 1/3 stroke - initial seek and partial rotation are figured in: the latter could be avoided by using a chunk size that's an integral multiple of the track size, but on today's zoned disks that's a bit awkward). A 1 MB chunk yields around 50% of the maximum streaming performance. ZFS's maximum 128 KB 'chunk size' if effectively used as the disk chunk size as you seem to be suggesting yields only about 15% of the disk's maximum streaming performance (leaving aside an additional degradation to a small fraction of even that should you use RAID-Z). And if you match the ZFS block size to a 16 KB database block size and use that as the effective unit of distribution across the set of disks, you'll obt ain a mighty 2% of the potential streaming performance (again, we'll be charitable and ignore the further degradation if RAID-Z is used). Now, if your system is doing nothing else but sequentially scanning this one database table, this may not be so bad: you get truly awful disk utilization (2% of its potential in the last case, ignoring RAID-Z), but you can still read ahead through the entire disk set and obtain decent sequential scanning performance by reading from all the disks in parallel. But if your database table scan is only one small part of a workload which is (perhaps the worst case) performing many other such scans in parallel, your overall system throughput will be only around 4% of what it could be had you used 1 MB chunks (and the individual scan performances will also suck commensurately, of course). Using 1 MB chunks still spreads out your database admirably for parallel random-access throughput: even if the table is only 1 GB in size (eminently cachable in RAM, should that be preferable), that'll spread it out across 1,000 disks (2,000, if you mirror it and load-balance to spread out the accesses), and for much smaller database tables if they're accessed sufficiently heavily for throughput to be an issue they'll be wholly cache-resident. Or another way to look at it is in terms of how many disks you have in your system: if it's less than the number of MB in your table size, then the table will be spread across all of them regardless of what chunk size is used, so you might as well use one that's large enough to give you decent sequential scanning performance (and if your table is too small to spread across all the disks, then it may well all wind up in cache anyway). ZFS's problem (well, the one specific to this issue, anyway) is that it tries to use its 'block size' to cover two different needs: performance for moderately fine-grained updates (though its need to propagate those updates upward to the root of the applicable tree significantly compromises this effort), and decent disk utilization (I'm using that term to describe throughput as a fraction of potential streaming throughput: just 'keeping the disks saturated' only describes where they system hits its throughput wall, not how well its design does in pushing that wall back as far as possible). The two requirements conflict, and in ZFS's case the latter one loses - badly. Which is why background defragmentation could help, as I previously noted: it could rearrange the table such that multiple virtually-sequential ZFS blocks were placed contiguously on each disk (to reach 1 MB total, in the current example) without affecting the ZFS block size per se. But every block so rearranged (and every tree ancestor of each such block) would then leave an equal-sized residue in the most recent snapshot if one existed, which gets expensive fast in terms of snapshot space overhead (which then is proportional to the amount of reorganization performed as well as to the amount of actual data updating). - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss