Greg Smith wrote:
> Kevin Grittner wrote:
> > I don't know at the protocol level; I just know that write barriers
> > do *something* which causes our controllers to wait for actual disk
> > platter persistence, while fsync does not
>
> It's in the docs now:
> http://www.postgresql.org/docs/9.0/static/wal-reliability.html
>
> FLUSH CACHE EXT is the ATAPI-6 call that filesystems use to enforce
> barriers on that type of drive. Here's what the relevant portion of the
> ATAPI spec says:
>
> "This command is used by the host to request the device to flush the
> write cache. If there is data in the write
> cache, that data shall be written to the media.The BSY bit shall remain
> set to one until all data has been
> successfully written or an error occurs."
>
> SAS systems have a similar call named SYNCHRONIZE CACHE.
>
> The improvement I actually expect to arrive here first is a reliable
> implementation of O_SYNC/O_DSYNC writes. Both SAS and SATA drives that
> capable of doing Native Command Queueing support a write type called
> "Force Unit Access", which is essentially just like a direct write that
> cannot be cached. When we get more kernels with reliable sync writing
> that maps under the hood to FUA, and can change wal_sync_method to use
> them, the need to constantly call fsync for every write to the WAL will
> go away. Then the "blow out the RAID cache when barriers are on"
> behavior will only show up during checkpoint fsyncs, which will make
> things a lot better (albeit still not ideal).
Great information! I have added the attached documentation patch to
explain the write-barrier/BBU interaction. This will appear in the 9.0
documentation.
--
Bruce Momjian <[email protected]> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ None of us is going to be here forever. +
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.66
diff -c -c -r1.66 wal.sgml
*** doc/src/sgml/wal.sgml 13 Apr 2010 14:15:25 -0000 1.66
--- doc/src/sgml/wal.sgml 7 Jul 2010 13:55:58 -0000
***************
*** 48,68 ****
some later time. Such caches can be a reliability hazard because the
memory in the disk controller cache is volatile, and will lose its
contents in a power failure. Better controller cards have
! <firstterm>battery-backed</> caches, meaning the card has a battery that
maintains power to the cache in case of system power loss. After power
is restored the data will be written to the disk drives.
</para>
<para>
And finally, most disk drives have caches. Some are write-through
! while some are write-back, and the
! same concerns about data loss exist for write-back drive caches as
! exist for disk controller caches. Consumer-grade IDE and SATA drives are
! particularly likely to have write-back caches that will not survive a
! power failure, though <acronym>ATAPI-6</> introduced a drive cache
! flush command (FLUSH CACHE EXT) that some file systems use, e.g. <acronym>ZFS</>.
! Many solid-state drives (SSD) also have volatile write-back
! caches, and many do not honor cache flush commands by default.
To check write caching on <productname>Linux</> use
<command>hdparm -I</>; it is enabled if there is a <literal>*</> next
to <literal>Write cache</>; <command>hdparm -W</> to turn off
--- 48,74 ----
some later time. Such caches can be a reliability hazard because the
memory in the disk controller cache is volatile, and will lose its
contents in a power failure. Better controller cards have
! <firstterm>battery-backed unit</> (<acronym>BBU</>) caches, meaning
! the card has a battery that
maintains power to the cache in case of system power loss. After power
is restored the data will be written to the disk drives.
</para>
<para>
And finally, most disk drives have caches. Some are write-through
! while some are write-back, and the same concerns about data loss
! exist for write-back drive caches as exist for disk controller
! caches. Consumer-grade IDE and SATA drives are particularly likely
! to have write-back caches that will not survive a power failure,
! though <acronym>ATAPI-6</> introduced a drive cache flush command
! (<command>FLUSH CACHE EXT</>) that some file systems use, e.g.
! <acronym>ZFS</>, <acronym>ext4</>. (The SCSI command
! <command>SYNCHRONIZE CACHE</> has long been available.) Many
! solid-state drives (SSD) also have volatile write-back caches, and
! many do not honor cache flush commands by default.
! </para>
!
! <para>
To check write caching on <productname>Linux</> use
<command>hdparm -I</>; it is enabled if there is a <literal>*</> next
to <literal>Write cache</>; <command>hdparm -W</> to turn off
***************
*** 83,88 ****
--- 89,113 ----
</para>
<para>
+ Many file systems that use write barriers (e.g. <acronym>ZFS</>,
+ <acronym>ext4</>) internally use <command>FLUSH CACHE EXT</> or
+ <command>SYNCHRONIZE CACHE</> commands to flush data to the platers on
+ write-back-enabled drives. Unfortunately, such write barrier file
+ systems behave suboptimally when combined with battery-backed unit
+ (<acronym>BBU</>) disk controllers. In such setups, the synchronize
+ command forces all data from the BBU to the disks, eliminating much
+ of the benefit of the BBU. You can run the utility
+ <filename>src/tools/fsync</> in the PostgreSQL source tree to see
+ if you are effected. If you are effected, the performance benefits
+ of the BBU cache can be regained by turning off write barriers in
+ the file system or reconfiguring the disk controller, if that is
+ an option. If write barriers are turned off, make sure the battery
+ remains active; a faulty battery can potentially lead to data loss.
+ Hopefully file system and disk controller designers will eventually
+ address this suboptimal behavior.
+ </para>
+
+ <para>
When the operating system sends a write request to the storage hardware,
there is little it can do to make sure the data has arrived at a truly
non-volatile storage area. Rather, it is the
--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance