> can you guess? wrote:
> >> For very read intensive and position sensitive
> >> applications, I guess 
> >> this sort of capability might make a difference?
> > 
> > No question about it.  And sequential table scans
> in databases 
> > are among the most significant examples, because
> (unlike things 
> > like streaming video files which just get laid down
> initially 
> > and non-synchronously in a manner that at least
> potentially 
> > allows ZFS to accumulate them in large, contiguous
> chunks - 
> > though ISTR some discussion about just how well ZFS
> managed 
> > this when it was accommodating multiple such write
> streams in 
> > parallel) the tables are also subject to
> fine-grained, 
> > often-random update activity.
> > 
> > Background defragmentation can help, though it
> generates a 
> > boatload of additional space overhead in any
> applicable snapshot.
> 
> The reason that this is hard to characterize is that
> there are
> really two very different configurations used to
> address different
> performance requirements: cheap and fast.  It seems
> that when most
> people first consider this problem, they do so from
> the cheap
> perspective: single disk view.  Anyone who strives
> for database
> performance will choose the fast perspective:
> stripes.

And anyone who *really* understands the situation will do both.

  Note: data
> redundancy isn't really an issue for this analysis,
> but consider it
> done in real life.  When you have a striped storage
> device under a
> file system, then the database or file system's view
> of contiguous
> data is not contiguous on the media.

The best solution is to make the data piece-wise contiguous on the media at the 
appropriate granularity - which is largely determined by disk access 
characteristics (the following assumes that the database table is large enough 
to be spread across a lot of disks at moderately coarse granularity, since 
otherwise it's often small enough to cache in the generous amounts of RAM that 
are inexpensively available today).

A single chunk on an (S)ATA disk today (the analysis is similar for 
high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size to yield 
over 80% of the disk's maximum possible (fully-contiguous layout) sequential 
streaming performance (after the overhead of an 'average' - 1/3 stroke - 
initial seek and partial rotation are figured in:  the latter could be avoided 
by using a chunk size that's an integral multiple of the track size, but on 
today's zoned disks that's a bit awkward).  A 1 MB chunk yields around 50% of 
the maximum streaming performance.  ZFS's maximum 128 KB 'chunk size' if 
effectively used as the disk chunk size as you seem to be suggesting yields 
only about 15% of the disk's maximum streaming performance (leaving aside an 
additional degradation to a small fraction of even that should you use RAID-Z). 
 And if you match the ZFS block size to a 16 KB database block size and use 
that as the effective unit of distribution across the set of disks, you'll obt
 ain a mighty 2% of the potential streaming performance (again, we'll be 
charitable and ignore the further degradation if RAID-Z is used).

Now, if your system is doing nothing else but sequentially scanning this one 
database table, this may not be so bad:  you get truly awful disk utilization 
(2% of its potential in the last case, ignoring RAID-Z), but you can still read 
ahead through the entire disk set and obtain decent sequential scanning 
performance by reading from all the disks in parallel.  But if your database 
table scan is only one small part of a workload which is (perhaps the worst 
case) performing many other such scans in parallel, your overall system 
throughput will be only around 4% of what it could be had you used 1 MB chunks 
(and the individual scan performances will also suck commensurately, of course).

Using 1 MB chunks still spreads out your database admirably for parallel 
random-access throughput:  even if the table is only 1 GB in size (eminently 
cachable in RAM, should that be preferable), that'll spread it out across 1,000 
disks (2,000, if you mirror it and load-balance to spread out the accesses), 
and for much smaller database tables if they're accessed sufficiently heavily 
for throughput to be an issue they'll be wholly cache-resident.  Or another way 
to look at it is in terms of how many disks you have in your system:  if it's 
less than the number of MB in your table size, then the table will be spread 
across all of them regardless of what chunk size is used, so you might as well 
use one that's large enough to give you decent sequential scanning performance 
(and if your table is too small to spread across all the disks, then it may 
well all wind up in cache anyway).

ZFS's problem (well, the one specific to this issue, anyway) is that it tries 
to use its 'block size' to cover two different needs:  performance for 
moderately fine-grained updates (though its need to propagate those updates 
upward to the root of the applicable tree significantly compromises this 
effort), and decent disk utilization (I'm using that term to describe 
throughput as a fraction of potential streaming throughput:  just 'keeping the 
disks saturated' only describes where they system hits its throughput wall, not 
how well its design does in pushing that wall back as far as possible).  The 
two requirements conflict, and in ZFS's case the latter one loses - badly.

Which is why background defragmentation could help, as I previously noted:  it 
could rearrange the table such that multiple virtually-sequential ZFS blocks 
were placed contiguously on each disk (to reach 1 MB total, in the current 
example) without affecting the ZFS block size per se.  But every block so 
rearranged (and every tree ancestor of each such block) would then leave an 
equal-sized residue in the most recent snapshot if one existed, which gets 
expensive fast in terms of snapshot space overhead (which then is proportional 
to the amount of reorganization performed as well as to the amount of actual 
data updating).

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to