Thanks Dimitri!  That was very educational material!  I'm going to think out 
loud here, so please correct me if you see any errors.

The section on tuning for OLTP transactions was interesting, although my OLAP 
workload will be predominantly bulk I/O over large datasets of 
mostly-sequential blocks.

The NFS+ZFS section talked about the zil_disable control for making zfs ignore 
commits/fsyncs.  Given that Postgres' executor does single-threaded synchronous 
I/O like the tar example, it seems like it might benefit significantly from 
setting zil_disable=1, at least in the case of frequently flushed/committed 
writes.  However, zil_disable=1 sounds unsafe for the datafiles' filesystem, 
and would probably only be acceptible for the xlogs if they're stored on a 
separate filesystem and you're willing to loose recently committed 
transactions.  This sounds pretty similar to just setting fsync=off in 
postgresql.conf, which is easier to change later, so I'll skip the zil_disable 
control.

The RAID-Z section was a little surprising.  It made RAID-Z sound just like 
RAID 50, in that you can customize the trade-off between iops versus usable 
diskspace and fault-tolerance by adjusting the number/size of parity-protected 
disk groups.  The only difference I noticed was that RAID-Z will apparently set 
the stripe size across vdevs (RAID-5s) to be as close as possible to the 
filesystem's block size, to maximize the number of disks involved in 
concurrently fetching each block.  Does that sound about right?

So now I'm wondering what RAID-Z offers that RAID-50 doesn't.  I came up with 2 
things: an alleged affinity for full-stripe writes and (under RAID-Z2) the 
added fault-tolerance of RAID-6's 2nd parity bit (allowing 2 disks to fail per 
zpool).  It wasn't mentioned in this blog, but I've heard that under certain 
circumstances, RAID-Z will magically decide to mirror a block instead of 
calculating parity on it.  I'm not sure how this would happen, and I don't know 
the circumstances that would trigger this behavior, but I think the goal (if it 
really happens) is to avoid the performance penalty of having to read the rest 
of the stripe required to calculate parity.  As far as I know, this is only an 
issue affecting small writes (e.g. single-row updates in an OLTP workload), but 
not large writes (compared to the RAID's stripe size).  Anyway, when I saw the 
filesystem's intent log mentioned, I thought maybe the small writes are 
converted to full-stripe writes by deferring their commit until a full stripe's 
worth of data had been accumulated.  Does that sound plausible?

Are there any other noteworthy perks to RAID-Z, rather than RAID-50?  If not, 
I'm inclined to go with your suggestion, Dimitri, and use zfs like RAID-10 to 
stripe a zpool over a bunch of RAID-1 vdevs.  Even though many of our queries 
do mostly sequential I/O, getting higher seeks/second is more important to us 
than the sacrificed diskspace.

For the record, those blogs also included a link to a very helpful ZFS Best 
Practices Guide:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

To sum up, so far the short list of tuning suggestions for ZFS includes:
 - Use a separate zpool and filesystem for xlogs if your apps write often.
 - Consider setting zil_disable=1 on the xlogs' dedicated filesystem.  ZIL is 
the intent log, and it sounds like disabling it may be like disabling 
journaling.  Previous message threads in the Postgres archives debate whether 
this is safe for the xlogs, but it didn't seem like a conclusive answer was 
reached.
 - Make filesystem block size (zfs record size) match the Postgres block size.
 - Manually adjust vdev_cache.  I think this sets the read-ahead size.  It 
defaults to 64 KB.  For OLTP workload, reduce it; for DW/OLAP maybe increase it.
 - Test various settings for vq_max_pending (until zfs can auto-tune it).  See 
http://blogs.sun.com/erickustarz/entry/vq_max_pending
 - A zpool of mirrored disks should support more seeks/second than RAID-Z, just 
like RAID 10 vs. RAID 50.  However, no single Postgres backend will see better 
than a single disk's seek rate, because the executor currently dispatches only 
1 logical I/O request at a time.


>>> Dimitri <[EMAIL PROTECTED]> 03/23/07 2:28 AM >>>
On Friday 23 March 2007 03:20, Matt Smiley wrote:
> My company is purchasing a Sunfire x4500 to run our most I/O-bound
> databases, and I'd like to get some advice on configuration and tuning. 
> We're currently looking at: - Solaris 10 + zfs + RAID Z
>  - CentOS 4 + xfs + RAID 10
>  - CentOS 4 + ext3 + RAID 10
> but we're open to other suggestions.
>

Matt,

for Solaris + ZFS you may find answers to all your questions here:

  http://blogs.sun.com/roch/category/ZFS
  http://blogs.sun.com/realneel/entry/zfs_and_databases

Think to measure log (WAL) activity and use separated pool for logs if needed. 
Also, RAID-Z is more security-oriented rather performance, RAID-10 should be 
a better choice...

Rgds,
-Dimitri



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Reply via email to