Re: [GENERAL] Maximum transaction rate

2009-03-30 Thread Marco Colombo
Markus Wanner wrote:
 Hi,
 
 Martijn van Oosterhout wrote:
 And fsync better do what you're asking
 (how fast is just a performance issue, just as long as it's done).
 
 Where are we on this issue? I've read all of this thread and the one on
 the lvm-linux mailing list as well, but still don't feel confident.
 
 In the following scenario:
 
   fsync - filesystem - physical disk
 
 I'm assuming the filesystem correctly issues an blkdev_issue_flush() on
 the physical disk upon fsync(), to do what it's told: flush the cache(s)
 to disk. Further, I'm also assuming the physical disk is flushable (i.e.
 it correctly implements the blkdev_issue_flush() call). Here we can be
 pretty certain that fsync works as advertised, I think.
 
 The unanswered question to me is, what's happening, if I add LVM in
 between as follows:
 
   fsync - filesystmem - device mapper (lvm) - physical disk(s)
 
 Again, assume the filesystem issues a blkdev_issue_flush() to the lower
 layer and the physical disks are all flushable (and implement that
 correctly). How does the device mapper behave?
 
 I'd expect it to forward the blkdev_issue_flush() call to all affected
 devices and only return after the last one has confirmed and completed
 flushing its caches. Is that the case?
 
 I've also read about the newish write barriers and about filesystems
 implementing fsync with such write barriers. That seems fishy to me and
 would of course break in combination with LVM (which doesn't completely
 support write barriers, AFAIU). However, that's clearly the filesystem
 side of the story and has not much to do with whether fsync lies on top
 of LVM or not.
 
 Help in clarifying this issue greatly appreciated.
 
 Kind Regards
 
 Markus Wanner

Well, AFAIK, the summary would be:

1) adding LVM to the chain makes no difference;

2) you still need to disable the write-back cache in IDE/SATA disks,
for fsync() to work properly.

3) without LVM and with write-back cache enabled, due to current(?)
limitations in the linux kernel, with some journaled filesystems
(but not ext3 in data=write-back or data=ordered mode, I'm not sure
about data=journal), you may be less vulnerable, if you use fsync()
(or O_SYNC).

less vulnerable means that all pending changes are commetted to disk,
but the very last one.

So:
- write-back cache + EXT3 = unsafe
- write-back cache + other fs = (depending on the fs)[*] safer but not 100% safe
- write-back cache + LVM + any fs = unsafe
- write-thru cache + any fs = safe
- write-thru cache + LVM + any fs = safe

[*] the fs must use (directly or indirectly via journal commit) a write barrier
on fsync(). Ext3 doesn't (it does when the inode changes, but that happens
once a second only).

If you want both speed and safety, use a batter-backed controller (and 
write-thru
cache on disks, but the controller should enforce it when you plug the disks 
in).
It's the usual Fast, Safe, Cheap: choose two.

This is an interesting article:

http://support.microsoft.com/kb/234656/en-us/

note how for all three kinds of disk (IDE/SATA/SCSI) they say:
Disk caching should be disabled in order to use the drive with SQL Server.

They don't mention write barriers.

.TM.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-25 Thread Markus Wanner
Hi,

Martijn van Oosterhout wrote:
 And fsync better do what you're asking
 (how fast is just a performance issue, just as long as it's done).

Where are we on this issue? I've read all of this thread and the one on
the lvm-linux mailing list as well, but still don't feel confident.

In the following scenario:

  fsync - filesystem - physical disk

I'm assuming the filesystem correctly issues an blkdev_issue_flush() on
the physical disk upon fsync(), to do what it's told: flush the cache(s)
to disk. Further, I'm also assuming the physical disk is flushable (i.e.
it correctly implements the blkdev_issue_flush() call). Here we can be
pretty certain that fsync works as advertised, I think.

The unanswered question to me is, what's happening, if I add LVM in
between as follows:

  fsync - filesystmem - device mapper (lvm) - physical disk(s)

Again, assume the filesystem issues a blkdev_issue_flush() to the lower
layer and the physical disks are all flushable (and implement that
correctly). How does the device mapper behave?

I'd expect it to forward the blkdev_issue_flush() call to all affected
devices and only return after the last one has confirmed and completed
flushing its caches. Is that the case?

I've also read about the newish write barriers and about filesystems
implementing fsync with such write barriers. That seems fishy to me and
would of course break in combination with LVM (which doesn't completely
support write barriers, AFAIU). However, that's clearly the filesystem
side of the story and has not much to do with whether fsync lies on top
of LVM or not.

Help in clarifying this issue greatly appreciated.

Kind Regards

Markus Wanner

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-20 Thread Martijn van Oosterhout
On Thu, Mar 19, 2009 at 12:49:52AM +0100, Marco Colombo wrote:
 It has to wait for I/O completion on write(), then, it has to go to
 sleep. If two different processes do a write(), you don't know which
 will be awakened first. Preallocation don't mean much here, since with
 O_SYNC you expect a physical write to be done (with the whole sleep/
 HW interrupt/SW interrupt/awake dance). It's true that you may expect
 the writes to be carried out in order, and that might be enough. I'm
 not sure tho.

True, but the relative wakeup order of two different processes is not
important since by definition they are working on different
transactions. As long as the WAL writes for a single transaction (in a
single process) are not reordered you're fine. The benefit of a
non-overwriting storage manager is that you don't need to worry about
undo's. Any incomplete transaction is uncomitted and so any data
produced by that transaction is ignored.

 It may be acceptable or not. Sometimes it's not. Sometimes you must be
 sure the data in on platters before you report committed. Sometimes
 when you say fsync! you mean i want data flushed to disk NOW, and I
 really mean it!. :)

Ofcourse. Committing a transaction comes down to flipping a single bit.
Before you flip it all the WAL data for that transaction must have hit
disk. And you don't tell the client the transaction has committed until
the fipped bit has hit disk. And fsync better do what you're asking
(how fast is just a performance issue, just as long as it's done).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [GENERAL] Maximum transaction rate

2009-03-20 Thread Marco Colombo
Martijn van Oosterhout wrote:
 True, but the relative wakeup order of two different processes is not
 important since by definition they are working on different
 transactions. As long as the WAL writes for a single transaction (in a
 single process) are not reordered you're fine.

I'm not totally sure, but I think I understand what you mean here,
indipendent transactions by definition don't care about relative ordering.

.TM.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-20 Thread Marco Colombo
Ron Mayer wrote:
 Marco Colombo wrote:
 Yes, but we knew it already, didn't we? It's always been like
 that, with IDE disks and write-back cache enabled, fsync just
 waits for the disk reporting completion and disks lie about
 
 I've looked hard, and I have yet to see a disk that lies.

No, lie in the sense they report completion before the data
hit the platters. Of course, that's the expected behaviour with
write-back caches.

 ext3, OTOH seems to lie.

ext3 simply doesn't know, it interfaces with a block device,
which does the caching (OS level) and the reordering (e.g. elevator
algorithm). ext3 doesn't directly send commands to the disk,
neither manages the OS cache.

When software raid and device mapper come into play, you have
virtual block devices built on top of other block devices.

My home desktop has ext3 on top of a dm device (/dev/mapper/something,
a LV set up by LVM in this case), on top of a raid1 device (/dev/mdX),
on top of /dev/sdaX and /dev/sdbX, which, in a way, on their own
are blocks device built on others, /dev/sda and /dev/sdb (you don't
actually send commands to partitions, do you? although the mapping
sector offset relative to partition - real sector on disk is
trivial).

Each of these layers potentially caches writes and reorders them, it's
the job of a block device, although it makes sense at most only for
the last one, the one that controls the disk. Anyway there isn't
much ext3 can do, but posting wb and flush requests to the block
device at the top of the stack.

 IDE drives happily report whether they support write barriers
 or not, which you can see with the command:
 %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT

Of course a write barrier is not a cache flush. A flush is
synchronous, a write barrier asyncronous. The disk supports
flushing, not write barriers. Well, technically if you can
control the ordering of the requests, that's barriers proper.
With SCSI you can, IIRC. But a cache flush is, well, a flush.

 Linux kernels since 2005 or so check for this feature.  It'll
 happily tell you which of your devices don't support it.
   %dmesg | grep 'disabling barriers'
   JBD: barrier-based sync failed on md1 - disabling barriers
 And for devices that do, it will happily send IDE FLUSH CACHE
 commands to IDE drives that support the feature.   At the same
 time Linux kernels started sending the very similar. SCSI
 SYNCHRONIZE CACHE commands.

 Anyway, it's the block device job to control disk caches. A
 filesystem is just a client to the block device, it posts a
 flush request, what happens depends on the block device code.
 The FS doesn't talk to disks directly. And a write barrier is
 not a flush request, is a please do not reorder request.
 On fsync(), ext3 issues a flush request to the block device,
 that's all it's expected to do.
 
 But AFAICT ext3 fsync() only tell the block device to
 flush disk caches if the inode was changed.

No, ext3 posts a write barrier request when the inode changes and it
commits the journal, which is not a flush. [*]

 Or, at least empirically if I modify a file and do
 fsync(fd); on ext3 it does not wait until the disk
 spun to where it's supposed to spin.   But if I put
 a couple fchmod()'s right before the fsync() it does.

If you were right, and ext3 didn't wait, it would make no
difference to have disk cache enabled or not, on fsync.
My test shows a 50x speedup when turning the disk cache on.
So for sure ext3 is waiting for the block device to report
completion. It's the block device that - on flush - doesn't
issue a FLUSH command to the disk.

.TM.

[*] A barrier ends up in a FLUSH for the disk, but it doesn't
mean it's synchronous, like a real flush. Even journal updates done
with barriers don't mean hit the disk now, they just mean keep
order when writing. If you turn off automatic page cache flushing
and if you have zero memory pressure, a write request with a
barrier may stay forever in the OS cache, at least in theory.

Imagine you don't have bdflush and nothing reclaims resources: days
of activity may stay in RAM, as far as write barriers are concerned.
Now someone types 'sync' as root. The block device starts flushing
dirty pages, reordering writes, but honoring barriers, that is,
it reorders anything up to the first barrier, posts write requests
to the disk, issues a FLUSH command then waits until the flush
is completed. Then consumes the barrier, and starts processing
writes, reordering them up to the next barrier, and so on.
So yes, a barrier turns into a FLUSH command for the disk. But in
this scenario, days have passed since the original write/barrier request
from the filesystem.

Compare with a fsync(). Even in the above scenario, a fsync() should
end up in a FLUSH command to the disk, and wait for the request to
complete, before awakening the process that issued it. So the filesystem
has to request a flush operation to the block device, not a barrier.
And so it does.

If it turns out that the block device just issues writes 

Re: [GENERAL] Maximum transaction rate

2009-03-19 Thread Joshua D. Drake
Hello,

As a continued follow up to this thread, Tim Post replied on the LVM
list to this affect:


If a logical volume spans physical devices where write caching is
enabled, the results of fsync() can not be trusted. This is an issue
with device mapper, lvm is one of a few possible customers of DM.

Now it gets interesting:

Enter virtualization. When you have something like this:

fsync - guest block device - block tap driver - CLVM - iscsi -
storage - physical disk.

Even if device mapper passed along the write barrier, would it be
reliable? Is every part of that chain going to pass the same along, and
how many opportunities for re-ordering are presented in the above?

So, even if its fixed in DM, can fsync() still be trusted? I think, at
the least, more testing should be done with various configurations even
after a suitable patch to DM is merged. What about PGSQL users using
some kind of elastic hosting?

Given the craze in 'cloud' technology, its an important question to ask
(and research). 


Cheers,
--Tim


Joshua D. Drake

-- 
PostgreSQL - XMPP: jdr...@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-19 Thread Ron Mayer
Marco Colombo wrote:
 Yes, but we knew it already, didn't we? It's always been like
 that, with IDE disks and write-back cache enabled, fsync just
 waits for the disk reporting completion and disks lie about

I've looked hard, and I have yet to see a disk that lies.

ext3, OTOH seems to lie.

IDE drives happily report whether they support write barriers
or not, which you can see with the command:
%hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT
I've tested about a dozen drives, and I've never seen one
claims to support flushing that doesn't.  And I haven't seen
one that doesn't support it that was made less than half a
decade ago.  IIRC, ATA-5 specs from 2000 made supporting
this mandatory.

Linux kernels since 2005 or so check for this feature.  It'll
happily tell you which of your devices don't support it.
  %dmesg | grep 'disabling barriers'
  JBD: barrier-based sync failed on md1 - disabling barriers
And for devices that do, it will happily send IDE FLUSH CACHE
commands to IDE drives that support the feature.   At the same
time Linux kernels started sending the very similar. SCSI
SYNCHRONIZE CACHE commands.


 Anyway, it's the block device job to control disk caches. A
 filesystem is just a client to the block device, it posts a
 flush request, what happens depends on the block device code.
 The FS doesn't talk to disks directly. And a write barrier is
 not a flush request, is a please do not reorder request.
 On fsync(), ext3 issues a flush request to the block device,
 that's all it's expected to do.

But AFAICT ext3 fsync() only tell the block device to
flush disk caches if the inode was changed.

Or, at least empirically if I modify a file and do
fsync(fd); on ext3 it does not wait until the disk
spun to where it's supposed to spin.   But if I put
a couple fchmod()'s right before the fsync() it does.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-19 Thread Baron Schwartz
I am jumping into this thread late, and maybe this has already been
stated clearly, but from my experience benchmarking, LVM does *not*
lie about fsync() on the servers I've configured.  An fsync() goes to
the physical device.  You can see it clearly by setting the write
cache on the RAID controller to write-through policy.  Performance
decreases to what the disks can do.

And my colleagues and clients have tested yanking the power plug and
checking that the data got to the RAID controller's battery-backed
cache, many many times.  In other words, the data is safe and durable,
even on LVM.

However, I have never tried to do this on volumes that span multiple
physical devices, because LVM can't take an atomic snapshot across
them, which completely negates the benefit of LVM for my purposes.  So
I always create one logical disk in the RAID controller, and then
carve that up with LVM, partitions, etc however I please.

I almost surely know less about this topic than anyone on this thread.

Baron

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-18 Thread Ron Mayer
Marco Colombo wrote:
 Ron Mayer wrote:
 Greg Smith wrote:
 There are some known limitations to Linux fsync that I remain somewhat
 concerned about, independantly of LVM, like ext3 fsync() only does a
 journal commit when the inode has changed (see
 http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 )
 I wonder if there should be an optional fsync mode
 in postgres should turn fsync() into
 fchmod (fd, 0644); fchmod (fd, 0664);
'course I meant: fchmod (fd, 0644); fchmod (fd, 0664); fsync(fd);
 to work around this issue.
 
 Question is... why do you care if the journal is not flushed on fsync?
 Only the file data blocks need to be, if the inode is unchanged.

You don't - but ext3 fsync won't even push the file data blocks
through a disk cache unless the inode was changed.

The point is that ext3 only does the write barrier processing
that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE (SCSI)
commands on inode changes, not data changes.   And with no FLUSH
CACHE or SYNCHRONINZE IDE the data blocks may sit in the disks
cache after the fsync() as well.

PS: not sure if this is still true - last time I tested it
was nov 2006.

   Ron

-
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-18 Thread Marco Colombo
Greg Smith wrote:
 On Wed, 18 Mar 2009, Marco Colombo wrote:
 
 If you fsync() after each write you want ordered, there can't be any
 subsequent I/O (unless there are many different processes
 cuncurrently writing to the file w/o synchronization).
 
 Inside PostgreSQL, each of the database backend processes ends up
 writing blocks to the database disk, if they need to allocate a new
 buffer and the one they are handed is dirty.  You can easily have
 several of those writing to the same 1GB underlying file on disk.  So
 that prerequisite is there.  The main potential for a problem here would
 be if a stray unsynchronized write from one of those backends happened
 in a way that wasn't accounted for by the WAL+checkpoint design.

Wow, that would be quite a bug. That's why I wrote w/o synchronization.
stray + unaccounted + cuncurrent smells like the recipe for an
explosive to me :)

 What I
 was suggesting is that the way that synchronization happens in the
 database provides some defense from running into problems in this area.

I hope it's full defence. If you have two processes doing at the
same time write(); fsycn(); on the same file, either there are no order
requirements, or it will boom sooner or later... fsync() works inside
a single process, but any system call may put the process to sleep, and
who knows when it will be awakened and what other processes did to that
file meanwhile. I'm pretty confident that PG code protects access to
shared resources with synchronization primitives.

Anyway I was referring to WAL writes... due to the nature of a log,
it's hard to think of many unordered writes and of cuncurrent access
w/o synchronization. But inside a critical region, there can be more
than one single write, and you may need to enforce an order, but no
more than that before the final fsycn(). If so, userland originated
barriers instead of full fsync()'s may help with performance.
But I'm speculating.

 The way backends handle writes themselves is also why your suggestion
 about the database being able to utilize barriers isn't really helpful.
 Those trickle out all the time, and normally you don't even have to care
 about ordering them.  The only you do need to care, at checkpoint time,
 only a hard line is really practical--all writes up to that point,
 period. Trying to implement ordered writes for everything that happened
 before then would complicate the code base, which isn't going to happen
 for such a platform+filesystem specific feature, one that really doesn't
 offer much acceleration from the database's perspective.

I don't know the internals of WAL writing, I can't really reply on that.

 only when the journal wraps around there's a (extremely) small window
 of vulnerability. You need to write a careful crafted torture program
 to get any chance to observe that... such program exists, and triggers
 the problem
 
 Yeah, I've been following all that.  The PostgreSQL WAL design works on
 ext2 filesystems with no journal at all.  Some people even put their
 pg_xlog directory onto ext2 filesystems for best performance, relying on
 the WAL to be the journal.  As long as fsync is honored correctly, the
 WAL writes should be re-writing already allocated space, which makes
 this category of journal mayhem not so much of a problem.  But when I
 read about fsync doing unexpected things, that gets me more concerned.

Well, that's highly dependant on your expectations :) I don't expect
a fsync to trigger a journal commit, if metadata hasn't changed. That's
obviuosly true for metadata-only journals (like most of them, with
notable exceptions of ext3 in data=journal mode).

Yet, if you're referring to this
http://article.gmane.org/gmane.linux.file-systems/21373

well that seems to me the same usual thing/bug, fsync() allows disks to
lie when it comes to caching writes. Nothing new under the sun.

Barriers don't change much, because they don't replace a flush. They're
about consistency, not durability. So even with full barriers support,
a fsync implementation needs to end up in a disk cache flush, to be fully
compliant with its own semantics.

.TM.

-
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-18 Thread Martijn van Oosterhout
On Wed, Mar 18, 2009 at 10:58:39PM +0100, Marco Colombo wrote:
 I hope it's full defence. If you have two processes doing at the
 same time write(); fsycn(); on the same file, either there are no order
 requirements, or it will boom sooner or later... fsync() works inside
 a single process, but any system call may put the process to sleep, and
 who knows when it will be awakened and what other processes did to that
 file meanwhile. I'm pretty confident that PG code protects access to
 shared resources with synchronization primitives.

Generally PG uses O_SYNC on open, so it's only one system call, not
two. And the file it's writing to is generally preallocated (not
always though).

 Well, that's highly dependant on your expectations :) I don't expect
 a fsync to trigger a journal commit, if metadata hasn't changed. That's
 obviuosly true for metadata-only journals (like most of them, with
 notable exceptions of ext3 in data=journal mode).

Really the only thing needed is that the WAL entry reaches disk before
the actual data does. AIUI as long as you have that the situation is
recoverable. Given that the actual data probably won't be written for a
while it'd need to go pretty wonky before you see an issue.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [GENERAL] Maximum transaction rate

2009-03-18 Thread Marco Colombo
Ron Mayer wrote:
 Marco Colombo wrote:
 Ron Mayer wrote:
 Greg Smith wrote:
 There are some known limitations to Linux fsync that I remain somewhat
 concerned about, independantly of LVM, like ext3 fsync() only does a
 journal commit when the inode has changed (see
 http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 )
 I wonder if there should be an optional fsync mode
 in postgres should turn fsync() into
 fchmod (fd, 0644); fchmod (fd, 0664);
 'course I meant: fchmod (fd, 0644); fchmod (fd, 0664); fsync(fd);
 to work around this issue.
 Question is... why do you care if the journal is not flushed on fsync?
 Only the file data blocks need to be, if the inode is unchanged.
 
 You don't - but ext3 fsync won't even push the file data blocks
 through a disk cache unless the inode was changed.
 
 The point is that ext3 only does the write barrier processing
 that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE (SCSI)
 commands on inode changes, not data changes.   And with no FLUSH
 CACHE or SYNCHRONINZE IDE the data blocks may sit in the disks
 cache after the fsync() as well.

Yes, but we knew it already, didn't we? It's always been like
that, with IDE disks and write-back cache enabled, fsync just
waits for the disk reporting completion and disks lie about
that. Write barriers enforce ordering, WHEN writes are
committed to disk, they will be in order, but that doesn't mean
NOW. Ordering is enough for FS a journal, the only requirement
is consistency.

Anyway, it's the block device job to control disk caches. A
filesystem is just a client to the block device, it posts a
flush request, what happens depends on the block device code.
The FS doesn't talk to disks directly. And a write barrier is
not a flush request, is a please do not reorder request.
On fsync(), ext3 issues a flush request to the block device,
that's all it's expected to do.

Now, some block devices may implement write barriers issuing
FLUSH commands to the disk, but that's another matter. A FS
shouldn't rely on that.

You can replace a barrier with a flush (not as efficently),
but not the other way around.

If a block device driver issues FLUSH for a barrier, and
doesn't issue a FLUSH for a flush, well, it's a buggy driver,
IMHO.

.TM.

-
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-18 Thread Greg Smith

On Wed, 18 Mar 2009, Martijn van Oosterhout wrote:


Generally PG uses O_SYNC on open


Only if you change wal_sync_method=open_sync.  That's the very last option 
PostgreSQL will try--only if none of the other are available will it use 
that.


Last time I checked the defaults value for that parameter broke down like 
this by platform:


open_datasync (O_DSYNC):  Solaris, Windows (I think there's a PG wrapper 
involved for Win32)


fdatasync:  Linux (even though the OS just provides a fake wrapper around 
fsync for that call)


fsync_writethrough:  Mac OS X

fsync:  FreeBSD

That makes the only UNIX{-ish} OS where the default is a genuine sync 
write Solaris.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

-
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-18 Thread Marco Colombo
Martijn van Oosterhout wrote:
 Generally PG uses O_SYNC on open, so it's only one system call, not
 two. And the file it's writing to is generally preallocated (not
 always though).

It has to wait for I/O completion on write(), then, it has to go to
sleep. If two different processes do a write(), you don't know which
will be awakened first. Preallocation don't mean much here, since with
O_SYNC you expect a physical write to be done (with the whole sleep/
HW interrupt/SW interrupt/awake dance). It's true that you may expect
the writes to be carried out in order, and that might be enough. I'm
not sure tho.

 Well, that's highly dependant on your expectations :) I don't expect
 a fsync to trigger a journal commit, if metadata hasn't changed. That's
 obviuosly true for metadata-only journals (like most of them, with
 notable exceptions of ext3 in data=journal mode).
 
 Really the only thing needed is that the WAL entry reaches disk before
 the actual data does. AIUI as long as you have that the situation is
 recoverable. Given that the actual data probably won't be written for a
 while it'd need to go pretty wonky before you see an issue.

You're giveing up Durability here. In a closed system, that doesn't mean
much, but when you report payment accepted to third parties, you can't
forget about it later. The requirement you stated is for Consistency only.
That's  what a journaled FS cares about, i.e. no need for fsck (internal
consistency checks) after a crash. It may be acceptable for a remote
standby backup, you replay as much of the WAL as it's available after
the crash (the part you managed to copy, that is). But you know there
can be lost transactions.

It may be acceptable or not. Sometimes it's not. Sometimes you must be
sure the data in on platters before you report committed. Sometimes
when you say fsync! you mean i want data flushed to disk NOW, and I
really mean it!. :)

.TM.

-
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-17 Thread Marco Colombo
John R Pierce wrote:
 Stefan Kaltenbrunner wrote:
 So in my understanding LVM is safe on disks that have write cache
 disabled or behave as one (like a controller with a battery backed
 cache).
 
 what about drive write caches on battery backed raid controllers?  do
 the controllers ensure the drive cache gets flushed prior to releasing
 the cached write blocks ?

If LVM/dm is lying about fsync(), all this is moot. There's no point
talking about disk caches.

BTW. This discussion is continuing on the linux-lvm mailing list.
https://www.redhat.com/archives/linux-lvm/2009-March/msg00025.html
I have some PG databases on LVM systems, so I need to know for sure
I have have to move them elsewhere. It seemed to me the right place
for asking about the issue.

Someone there pointed out that fsycn() is not LVM's responsibility.

Correct. For sure, there's an API (or more than one) a filesystem uses
to force a flush on the underlying block device, and for sure it has to
called while inside the fsync() system call.

So lying to fsync() maybe is more correct than lying about fsync().

.TM.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-17 Thread Greg Smith

On Tue, 17 Mar 2009, Marco Colombo wrote:


If LVM/dm is lying about fsync(), all this is moot. There's no point
talking about disk caches.


I decided to run some tests to see what's going on there, and it looks 
like some of my quick criticism of LVM might not actually be valid--it's 
only the performance that is problematic, not necessarily the reliability. 
Appears to support fsync just fine.  I tested with kernel 2.6.22, so 
certainly not before the recent changes to LVM behavior improving this 
area, but with the bugs around here from earlier kernels squashed (like 
crummy HPA support circa 2.6.18-2.6.19, see 
https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82314 )


You can do a quick test of fsync rate using sysbench; got the idea from 
http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/

(their command has some typos, fixed one below)

If fsync is working properly, you'll get something near the RPM rate of 
the disk.  If it's lying, you'll see a much higher number.


I couldn't get the current sysbench-0.4.11 to compile (bunch of X 
complains from libtool), but the old 0.4.8 I had around still works fine. 
Let's start with a regular ext3 volume.  Here's what I see against a 7200 
RPM disk (=120 rotations/second) with the default caching turned on:


$ alias fsynctest=~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 
--file-num=1 --file-total-size=16384 --file-test-mode=rndwr run | grep 
\Requests/sec\
$ fsynctest
 6469.36 Requests/sec executed

That's clearly lying as expected (and I ran all these a couple of times, 
just reporting one for brevity sake; snipped some other redundant stuff 
too).  I followed the suggestions at 
http://www.postgresql.org/docs/current/static/wal-reliability.html to turn 
off the cache and tested again:


$ sudo /sbin/hdparm -I /dev/sdf | grep Write cache
   *Write cache
$ sudo /sbin/hdparm -W0 /dev/sdf

/dev/sdf:
 setting drive write-caching to 0 (off)
$ sudo /sbin/hdparm -I /dev/sdf | grep Write cache
Write cache
$ fsynctest
  106.05 Requests/sec executed
$ sudo /sbin/hdparm -W1 /dev/sdf
$ fsynctest
 6469.36 Requests/sec executed

Great:  I was expecting ~120 commits/sec from a 7200 RPM disk, that's what 
I get when caching is off.


Now, let's switch to using a LVM volume on a different partition of 
that disk, and run the same test to see if anything changes.


$ sudo mount /dev/lvmvol/lvmtest /mnt/
$ cd /mnt/test
$ fsynctest
 6502.67 Requests/sec executed
$ sudo /sbin/hdparm -W0 /dev/sdf
$ fsynctest
  112.78 Requests/sec executed
$ sudo /sbin/hdparm -W1 /dev/sdf
$ fsynctest
 6499.11 Requests/sec executed

Based on this test, it looks to me like fsync works fine on LVM.  It must 
be passing that down to the physical disk correctly or I'd still be seeing 
inflated rates.  If you've got a physical disk that lies about fsync, and 
you put a database on it, you're screwed whether or not you use LVM; 
nothing different on LVM than in the regular case.  A battery-backed 
caching controller should also handle fsync fine if it turns off the 
physical disk cache, which most of them do--and, again, you're no more or 
less exposed to that particular problem with LVM than a regular 
filesystem.


The thing that barriers helps out with is that it makes it possible to 
optimize flushing ext3 journal metadata when combined with hard drives 
that support the appropriate cache flushing mechanism (what hdparm calls 
FLUSH CACHE EXT; see 
http://forums.opensuse.org/archives/sls-archives/archives-suse-linux/archives-desktop-environments/379681-barrier-sync.html 
).  That way you can prioritize flushing just the metadata needed to 
prevent filesystem corruption while still fully caching less critical 
regular old writes.  In that situation, performance could be greatly 
improved over turning off caching altogether.  However, in the PostgreSQL 
case, the fsync hammer doesn't appreciate this optimization anyway--all 
the database writes are going to get forced out by that no matter what 
before the database considers them reliable.  Proper barriers support 
might be helpful in the case where you're using a database on a shared 
disk that has other files being written to as well, basically allowing 
caching on those while forcing the database blocks to physical disk, but 
that presumes the Linux fsync implementation is more sophisticated than I 
believe it currently is.


Far as I can tell, the main open question I didn't directly test here is 
whether LVM does any write reordering that can impact database use because 
it doesn't handle write barriers properly.  According to 
https://www.redhat.com/archives/linux-lvm/2009-March/msg00026.html it does 
not, and I never got the impression that was impacted by the LVM layer 
before.  The concern is nicely summarized by the comment from Xman at 
http://lwn.net/Articles/283161/ :


fsync will block until the outstanding requests have been sync'd do disk, 

Re: [GENERAL] Maximum transaction rate

2009-03-17 Thread Ron Mayer
Greg Smith wrote:
 There are some known limitations to Linux fsync that I remain somewhat
 concerned about, independantly of LVM, like ext3 fsync() only does a
 journal commit when the inode has changed (see
 http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ).  The
 way files are preallocated, the PostgreSQL WAL is supposed to function
 just fine even if you're using fdatasync after WAL writes, which also
 wouldn't touch the journal (last time I checked fdatasync was
 implemented as a full fsync on Linux).  Since the new ext4 is more

Indeed it does.

I wonder if there should be an optional fsync mode
in postgres should turn fsync() into
fchmod (fd, 0644); fchmod (fd, 0664);
to work around this issue.

For example this program below will show one write
per disk revolution if you leave the fchmod() in there,
and run many times faster (i.e. lying) if you remove it.
This with ext3 on a standard IDE drive with the write
cache enabled, and no LVM or anything between them.

==
/*
** based on http://article.gmane.org/gmane.linux.file-systems/21373
** http://thread.gmane.org/gmane.linux.kernel/646040
*/
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include unistd.h
#include stdio.h
#include stdlib.h

int main(int argc,char *argv[]) {
  if (argc2) {
printf(usage: fs filename\n);
exit(1);
  }
  int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
  int i;
  for (i=0;i100;i++) {
char byte;
pwrite (fd, byte, 1, 0);
fchmod (fd, 0644); fchmod (fd, 0664);
fsync (fd);
  }
}
==


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-17 Thread Greg Smith

On Tue, 17 Mar 2009, Ron Mayer wrote:


I wonder if there should be an optional fsync mode
in postgres should turn fsync() into
   fchmod (fd, 0644); fchmod (fd, 0664);
to work around this issue.


The test I haven't had time to run yet is to turn the bug exposing program 
you were fiddling with into a more accurate representation of WAL 
activity, to see if that chmod still changes the behavior there. I think 
the most dangerous possibility here is if you create a new WAL segment and 
immediately fill it, all in less than a second.  Basically, what 
XLogFileInit does:


-Open with O_RDWR | O_CREAT | O_EXCL
-Write XLogSegSize (16MB) worth of zeros
-fsync

Followed by simulating what XLogWrite would do if you fed it enough data 
to force a segment change:


-Write a new 16MB worth of data
-fsync

If you did all that in under a second, would you still get a filesystem 
flush each time?  From the description of the problem I'm not so sure 
anymore.  I think that's how tight the window would have to be for this 
issue to show up right now, you'd only be exposed if you filled a new WAL 
segment faster than the associated journal commit happened (basically, a 
crash when WAL write volume 16MB/s in a situation where new segments are 
being created).  But from what I've read about ext4 I think that window 
for mayhem might widen on that filesystem--that's what got me reading up 
on this whole subject recently, before this thread even started.


The other ameliorating factor here is that in order for this to bite you, 
I think you'd need to have another, incorrectly ordered write somewhere 
else that could happen before the delayed write.  Not sure where that 
might be possible in the PostgreSQL WAL implementation yet.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-17 Thread Marco Colombo
Greg Smith wrote:
 On Tue, 17 Mar 2009, Marco Colombo wrote:
 
 If LVM/dm is lying about fsync(), all this is moot. There's no point
 talking about disk caches.
 
 I decided to run some tests to see what's going on there, and it looks
 like some of my quick criticism of LVM might not actually be valid--it's
 only the performance that is problematic, not necessarily the
 reliability. Appears to support fsync just fine.  I tested with kernel
 2.6.22, so certainly not before the recent changes to LVM behavior
 improving this area, but with the bugs around here from earlier kernels
 squashed (like crummy HPA support circa 2.6.18-2.6.19, see
 https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82314 )

I've run tests too, you can seen them here:
https://www.redhat.com/archives/linux-lvm/2009-March/msg00055.html
in case you're looking for something trivial (write/fsync loop).

 You can do a quick test of fsync rate using sysbench; got the idea from
 http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/
 (their command has some typos, fixed one below)
 
 If fsync is working properly, you'll get something near the RPM rate of
 the disk.  If it's lying, you'll see a much higher number.

Same results. -W1 gives x50 speedup, it must be waiting for something
at disk level with -W0.

[...]

 Based on this test, it looks to me like fsync works fine on LVM.  It
 must be passing that down to the physical disk correctly or I'd still be
 seeing inflated rates.  If you've got a physical disk that lies about
 fsync, and you put a database on it, you're screwed whether or not you
 use LVM; nothing different on LVM than in the regular case.  A
 battery-backed caching controller should also handle fsync fine if it
 turns off the physical disk cache, which most of them do--and, again,
 you're no more or less exposed to that particular problem with LVM than
 a regular filesystem.

That was my initial understanding.

 The thing that barriers helps out with is that it makes it possible to
 optimize flushing ext3 journal metadata when combined with hard drives
 that support the appropriate cache flushing mechanism (what hdparm calls
 FLUSH CACHE EXT; see
 http://forums.opensuse.org/archives/sls-archives/archives-suse-linux/archives-desktop-environments/379681-barrier-sync.html
 ).  That way you can prioritize flushing just the metadata needed to
 prevent filesystem corruption while still fully caching less critical
 regular old writes.  In that situation, performance could be greatly
 improved over turning off caching altogether.  However, in the
 PostgreSQL case, the fsync hammer doesn't appreciate this optimization
 anyway--all the database writes are going to get forced out by that no
 matter what before the database considers them reliable.  Proper
 barriers support might be helpful in the case where you're using a
 database on a shared disk that has other files being written to as well,
 basically allowing caching on those while forcing the database blocks to
 physical disk, but that presumes the Linux fsync implementation is more
 sophisticated than I believe it currently is.

This is the same conclusion I came to. Moreover, once you have barriers
passed down to the disks, it would be nice to have a userland API to send
them to the kernel. Any application managing a 'journal' or 'log' type
of object, would benefit from that. I'm not familiar with PG internals,
but it's likely you can have some records you just want to be ordered, and
you can do something like write-barrier-write-barrier-...-fsync instead of
write-fsync-write-fsync-... Currenly fsync() (and friends, O_SYNC,
fdatasync(), O_DSYNC) is the only way to enforce ordering on writes
from userland.

 Far as I can tell, the main open question I didn't directly test here is
 whether LVM does any write reordering that can impact database use
 because it doesn't handle write barriers properly.  According to
 https://www.redhat.com/archives/linux-lvm/2009-March/msg00026.html it
 does not, and I never got the impression that was impacted by the LVM
 layer before.  The concern is nicely summarized by the comment from Xman
 at http://lwn.net/Articles/283161/ :
 
 fsync will block until the outstanding requests have been sync'd do
 disk, but it doesn't guarantee that subsequent I/O's to the same fd
 won't potentially also get completed, and potentially ahead of the I/O's
 submitted prior to the fsync. In fact it can't make such guarantees
 without functioning barriers.

Sure, but from userland you can't set barriers. If you fsync() after each
write you want ordered, there can't be any subsequent I/O (unless
there are many different processes cuncurrently writing to the file
w/o synchronization).

 Since we know LVM does not have functioning barriers, this would seem to
 be one area where PostgreSQL would be vulnerable.  But since ext3
 doesn't have barriers turned by default either (except some recent SuSE
 system), it's not unique to a LVM setup, 

Re: [GENERAL] Maximum transaction rate

2009-03-17 Thread Marco Colombo
Ron Mayer wrote:
 Greg Smith wrote:
 There are some known limitations to Linux fsync that I remain somewhat
 concerned about, independantly of LVM, like ext3 fsync() only does a
 journal commit when the inode has changed (see
 http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ).  The
 way files are preallocated, the PostgreSQL WAL is supposed to function
 just fine even if you're using fdatasync after WAL writes, which also
 wouldn't touch the journal (last time I checked fdatasync was
 implemented as a full fsync on Linux).  Since the new ext4 is more
 
 Indeed it does.
 
 I wonder if there should be an optional fsync mode
 in postgres should turn fsync() into
 fchmod (fd, 0644); fchmod (fd, 0664);
 to work around this issue.

Question is... why do you care if the journal is not flushed on fsync?
Only the file data blocks need to be, if the inode is unchanged.

 For example this program below will show one write
 per disk revolution if you leave the fchmod() in there,
 and run many times faster (i.e. lying) if you remove it.
 This with ext3 on a standard IDE drive with the write
 cache enabled, and no LVM or anything between them.
 
 ==
 /*
 ** based on http://article.gmane.org/gmane.linux.file-systems/21373
 ** http://thread.gmane.org/gmane.linux.kernel/646040
 */
 #include sys/types.h
 #include sys/stat.h
 #include fcntl.h
 #include unistd.h
 #include stdio.h
 #include stdlib.h
 
 int main(int argc,char *argv[]) {
   if (argc2) {
 printf(usage: fs filename\n);
 exit(1);
   }
   int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
   int i;
   for (i=0;i100;i++) {
 char byte;
 pwrite (fd, byte, 1, 0);
 fchmod (fd, 0644); fchmod (fd, 0664);
 fsync (fd);
   }
 }
 ==
 

I ran the program above, w/o the fchmod()s.

$ time ./test2 testfile

real0m0.056s
user0m0.001s
sys 0m0.008s

This is with ext3+LVM+raid1+sata disks with hdparm -W1.
With -W0 I get:

$ time ./test2 testfile

real0m1.014s
user0m0.000s
sys 0m0.008s

Big difference. The fsync() there does its job.

The same program runs with a x3 slowdown with the fsyncs, but that's
expected, it's doing twice the writes, and in different places.

.TM.

-
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-17 Thread Greg Smith

On Wed, 18 Mar 2009, Marco Colombo wrote:

If you fsync() after each write you want ordered, there can't be any 
subsequent I/O (unless there are many different processes cuncurrently 
writing to the file w/o synchronization).


Inside PostgreSQL, each of the database backend processes ends up writing 
blocks to the database disk, if they need to allocate a new buffer and the 
one they are handed is dirty.  You can easily have several of those 
writing to the same 1GB underlying file on disk.  So that prerequisite is 
there.  The main potential for a problem here would be if a stray 
unsynchronized write from one of those backends happened in a way that 
wasn't accounted for by the WAL+checkpoint design.  What I was suggesting 
is that the way that synchronization happens in the database provides some 
defense from running into problems in this area.


The way backends handle writes themselves is also why your suggestion 
about the database being able to utilize barriers isn't really helpful. 
Those trickle out all the time, and normally you don't even have to care 
about ordering them.  The only you do need to care, at checkpoint time, 
only a hard line is really practical--all writes up to that point, period. 
Trying to implement ordered writes for everything that happened before 
then would complicate the code base, which isn't going to happen for such 
a platform+filesystem specific feature, one that really doesn't offer much 
acceleration from the database's perspective.


only when the journal wraps around there's a (extremely) small window of 
vulnerability. You need to write a careful crafted torture program to 
get any chance to observe that... such program exists, and triggers the 
problem


Yeah, I've been following all that.  The PostgreSQL WAL design works on 
ext2 filesystems with no journal at all.  Some people even put their 
pg_xlog directory onto ext2 filesystems for best performance, relying on 
the WAL to be the journal.  As long as fsync is honored correctly, the WAL 
writes should be re-writing already allocated space, which makes this 
category of journal mayhem not so much of a problem.  But when I read 
about fsync doing unexpected things, that gets me more concerned.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

-
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-16 Thread Stefan Kaltenbrunner

Tom Lane wrote:

Jack Orenstein jack.orenst...@hds.com writes:

The transaction rates I'm getting seem way too high: 2800-2900 with
one thread, 5000-7000 with ten threads. I'm guessing that writes
aren't really reaching the disk. Can someone suggest how to figure out
where, below postgres, someone is lying about writes reaching the
disk?


AFAIK there are two trouble sources in recent Linux machines: LVM and
the disk drive itself.  LVM is apparently broken by design --- it simply
fails to pass fsync requests.  If you're using it you have to stop.
(Which sucks, because it's exactly the kind of thing DBAs tend to want.)
Otherwise you need to reconfigure your drive to not cache writes.
I forget the incantation for that but it's in the PG list archives.


hmm are you sure this is what is happening?
In my understanding LVM is not passing down barriers(generally - it 
seems to do in some limited circumstances) which means in my 
understanding it is not safe on any storage drive that has write cache 
enabled. This seems to be the very same issue like linux had for ages 
before ext3 got barrier support(not sure if even today all filesystems 
do have that).
So in my understanding LVM is safe on disks that have write cache 
disabled or behave as one (like a controller with a battery backed cache).
For storage with write caches it seems to be unsafe, even if the 
filesystem supports barriers and it has them enabled (which I don't 
think all have) which is basically what all of linux was not too long ago.



Stefan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-16 Thread Scott Marlowe
On Mon, Mar 16, 2009 at 2:03 PM, Stefan Kaltenbrunner
ste...@kaltenbrunner.cc wrote:
 So in my understanding LVM is safe on disks that have write cache disabled
 or behave as one (like a controller with a battery backed cache).
 For storage with write caches it seems to be unsafe, even if the filesystem
 supports barriers and it has them enabled (which I don't think all have)
 which is basically what all of linux was not too long ago.

I definitely didn't have this problem with SCSI drives directly
attached to a machine under pgsql on ext2 back in the day (way back,
like 5 to 10 years ago).  IDE / PATA drives, on the other hand,
definitely suffered with having write caches enabled.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-16 Thread John R Pierce

Stefan Kaltenbrunner wrote:
So in my understanding LVM is safe on disks that have write cache 
disabled or behave as one (like a controller with a battery backed 
cache).


what about drive write caches on battery backed raid controllers?  do 
the controllers ensure the drive cache gets flushed prior to releasing 
the cached write blocks ?




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-16 Thread Stefan Kaltenbrunner

Scott Marlowe wrote:

On Mon, Mar 16, 2009 at 2:03 PM, Stefan Kaltenbrunner
ste...@kaltenbrunner.cc wrote:

So in my understanding LVM is safe on disks that have write cache disabled
or behave as one (like a controller with a battery backed cache).
For storage with write caches it seems to be unsafe, even if the filesystem
supports barriers and it has them enabled (which I don't think all have)
which is basically what all of linux was not too long ago.


I definitely didn't have this problem with SCSI drives directly
attached to a machine under pgsql on ext2 back in the day (way back,
like 5 to 10 years ago).  IDE / PATA drives, on the other hand,
definitely suffered with having write caches enabled.


I guess thats likely because most SCSI drives (at least back in the 
days) had write caches turned off by default (whereas IDE drives had 
them turned on).
The Linux kernel docs actually have some stuff on the barrier 
implementation (
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob_plain;f=Documentation/block/barrier.txt;hb=HEAD) 
which seems to explain some of the issues related to that.



Stefan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-15 Thread Marco Colombo
Joshua D. Drake wrote:
 
 I understand but disabling cache is not an option for anyone I know. So
 I need to know the other :)
 
 Joshua D. Drake
 

Come on, how many people/organizations do you know who really need 30+ MB/s
sustained write throughtput in the disk subsystem but can't afford a
battery backed controller at the same time?

Something must take care of writing data in the disk cache on permanent
storage; write-thru caches, battery backed controllers, write barriers
are all alternatives, choose the one you like most.

The problem here is fsync(). We know that not fsync()'ing gives you a big
performance boost, but that's not the point. I want to choose, and I want
a true fsync() when I ask for one. Because if the data don't make it to
the disk cache, the whole point about wt, bb and wb is moot.

.TM.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-14 Thread Joshua D. Drake
On Sat, 2009-03-14 at 05:25 +0100, Marco Colombo wrote:
 Scott Marlowe wrote:

 Also see:
 http://lkml.org/lkml/2008/2/26/41
 but it seems to me that all this discussion is under the assuption that
 disks have write-back caches.
 The alternative is to disable the disk write cache. says it all.

If this applies to raid based cache as well then performance is going to
completely tank. For users of Linux + PostgreSQL using LVM.

Joshua D. Drake

 
 .TM.
 
-- 
PostgreSQL - XMPP: jdr...@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-14 Thread Marco Colombo
Joshua D. Drake wrote:
 On Sat, 2009-03-14 at 05:25 +0100, Marco Colombo wrote:
 Scott Marlowe wrote:
 
 Also see:
 http://lkml.org/lkml/2008/2/26/41
 but it seems to me that all this discussion is under the assuption that
 disks have write-back caches.
 The alternative is to disable the disk write cache. says it all.
 
 If this applies to raid based cache as well then performance is going to
 completely tank. For users of Linux + PostgreSQL using LVM.
 
 Joshua D. Drake

Yet that's not the point. The point is safety. I may have a lightly loaded
database, with low write rate, but still I want it to be reliable. I just
want to know if disabling the caches makes it reliable or not. People on LK
seem to think it does. And it seems to me they may have a point.
fsync() is a flush operation on the block device, not a write barrier. LVM
doesn't pass write barriers down, but that doesn't mean it doesn't perform
a flush when requested to.

.TM.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-14 Thread Joshua D. Drake
On Sun, 2009-03-15 at 01:48 +0100, Marco Colombo wrote:
 Joshua D. Drake wrote:
  On Sat, 2009-03-14 at 05:25 +0100, Marco Colombo wrote:
  Scott Marlowe wrote:
  
  Also see:
  http://lkml.org/lkml/2008/2/26/41
  but it seems to me that all this discussion is under the assuption that
  disks have write-back caches.
  The alternative is to disable the disk write cache. says it all.
  
  If this applies to raid based cache as well then performance is going to
  completely tank. For users of Linux + PostgreSQL using LVM.
  
  Joshua D. Drake
 
 Yet that's not the point. The point is safety. I may have a lightly loaded
 database, with low write rate, but still I want it to be reliable. I just
 want to know if disabling the caches makes it reliable or not.

I understand but disabling cache is not an option for anyone I know. So
I need to know the other :)

Joshua D. Drake

-- 
PostgreSQL - XMPP: jdr...@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Marco Colombo
Scott Marlowe wrote:
 On Fri, Mar 6, 2009 at 2:22 PM, Ben Chobot be...@silentmedia.com wrote:
 On Fri, 6 Mar 2009, Greg Smith wrote:

 On Fri, 6 Mar 2009, Tom Lane wrote:

  Otherwise you need to reconfigure your drive to not cache writes.
  I forget the incantation for that but it's in the PG list archives.
 There's a dicussion of this in the docs now,
 http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html
 How does turning off write caching on the disk stop the problem with LVM? It
 still seems like you have to get the data out of the OS buffer, and if
 fsync() doesn't do that for you
 
 I think he was saying otherwise (if you're not using LVM and you still
 have this super high transaction rate) you'll need to turn off the
 drive's write caches.  I kinda wondered at it for a second too.
 

And I'm still wondering. The problem with LVM, AFAIK, is missing support
for write barriers. Once you disable the write-back cache on the disk,
you no longer need write barriers. So I'm missing something, what else
does LVM do to break fsync()?

It was my understanding that disabling disk caches was enough.

.TM.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Tom Lane
Marco Colombo pg...@esiway.net writes:
 And I'm still wondering. The problem with LVM, AFAIK, is missing support
 for write barriers. Once you disable the write-back cache on the disk,
 you no longer need write barriers. So I'm missing something, what else
 does LVM do to break fsync()?

I think you're imagining that the disk hardware is the only source of
write reordering, which isn't the case ... various layers in the kernel
can reorder operations before they get sent to the disk.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Marco Colombo
Tom Lane wrote:
 Marco Colombo pg...@esiway.net writes:
 And I'm still wondering. The problem with LVM, AFAIK, is missing support
 for write barriers. Once you disable the write-back cache on the disk,
 you no longer need write barriers. So I'm missing something, what else
 does LVM do to break fsync()?
 
 I think you're imagining that the disk hardware is the only source of
 write reordering, which isn't the case ... various layers in the kernel
 can reorder operations before they get sent to the disk.
 
   regards, tom lane

You mean some layer (LVM) is lying about the fsync()?

write(A);
fsync();
...
write(B);
fsync();
...
write(C);
fsync();

you mean that the process may be awakened after the first fsync() while
A is still somewhere in OS buffers and not sent to disk yet, so it's
possible that B gets to the disk BEFORE A. And if the system crashes,
A never hits the platters while B (possibly) does. Is it this you
mean by write reodering?

But doesn't this break any application with transactional-like behavior,
such as sendmail? The problem being 3rd parties, if sendmail declares
ok, I saved the message (*after* a fsync()) to the SMTP client,
it's actually lying 'cause the message hasn't hit the platters yet.
Same applies to IMAP/POP server, say. Well, it applies to anything
using fsync().

I mean, all this with disk caches in write-thru modes? It's the OS
lying, not the disks?

Wait, this breaks all journaled FSes as well, a DM device is just
a block device to them, if it's lying about synchronous writes the
whole purpose of the journal is defeated... I find it hard to
believe, I have to say.

.TM.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Tom Lane
Marco Colombo pg...@esiway.net writes:
 You mean some layer (LVM) is lying about the fsync()?

Got it in one.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Joshua D. Drake
On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote:
 Marco Colombo pg...@esiway.net writes:
  You mean some layer (LVM) is lying about the fsync()?
 
 Got it in one.
 

I wouldn't think this would be a problem with the proper battery backed
raid controller correct?

Joshua D. Drake


   regards, tom lane
 
-- 
PostgreSQL - XMPP: jdr...@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Ben Chobot

On Fri, 13 Mar 2009, Joshua D. Drake wrote:


On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote:

Marco Colombo pg...@esiway.net writes:

You mean some layer (LVM) is lying about the fsync()?


Got it in one.



I wouldn't think this would be a problem with the proper battery backed
raid controller correct?


It seems to me that all you get with a BBU-enabled card is the ability to 
get burts of writes out of the OS faster. So you still have the problem, 
it's just less like to be encountered.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Joshua D. Drake
On Fri, 2009-03-13 at 11:17 -0700, Ben Chobot wrote:
 On Fri, 13 Mar 2009, Joshua D. Drake wrote:
 
  On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote:
  Marco Colombo pg...@esiway.net writes:
  You mean some layer (LVM) is lying about the fsync()?
 
  Got it in one.
 
 
  I wouldn't think this would be a problem with the proper battery backed
  raid controller correct?
 
 It seems to me that all you get with a BBU-enabled card is the ability to 
 get burts of writes out of the OS faster. So you still have the problem, 
 it's just less like to be encountered.

A BBU controller is about more than that. It is also supposed to be
about data integrity. The ability to have unexpected outages and have
the drives stay consistent because the controller remembers the state
(if that is a reasonable way to put it).

Joshua D. Drake


 
-- 
PostgreSQL - XMPP: jdr...@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Ben Chobot

On Fri, 13 Mar 2009, Joshua D. Drake wrote:


It seems to me that all you get with a BBU-enabled card is the ability to
get burts of writes out of the OS faster. So you still have the problem,
it's just less like to be encountered.


A BBU controller is about more than that. It is also supposed to be
about data integrity. The ability to have unexpected outages and have
the drives stay consistent because the controller remembers the state
(if that is a reasonable way to put it).


Of course. But if you can't reliably flush the OS buffers (because, say, 
you're using LVM so fsync() doesn't work), then you can't say what 
actually has made it to the safety of the raid card.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Joshua D. Drake
On Fri, 2009-03-13 at 11:41 -0700, Ben Chobot wrote:
 On Fri, 13 Mar 2009, Joshua D. Drake wrote:

 Of course. But if you can't reliably flush the OS buffers (because, say, 
 you're using LVM so fsync() doesn't work), then you can't say what 
 actually has made it to the safety of the raid card.

Good point. So the next question of course is, does EVMS do it right?

http://evms.sourceforge.net/

This is actually a pretty significant issue. 

Joshua D. Drake


 
-- 
PostgreSQL - XMPP: jdr...@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Joshua D. Drake
On Fri, 2009-03-13 at 11:41 -0700, Ben Chobot wrote:
 On Fri, 13 Mar 2009, Joshua D. Drake wrote:
 
  It seems to me that all you get with a BBU-enabled card is the ability to
  get burts of writes out of the OS faster. So you still have the problem,
  it's just less like to be encountered.
 
  A BBU controller is about more than that. It is also supposed to be
  about data integrity. The ability to have unexpected outages and have
  the drives stay consistent because the controller remembers the state
  (if that is a reasonable way to put it).
 
 Of course. But if you can't reliably flush the OS buffers (because, say, 
 you're using LVM so fsync() doesn't work), then you can't say what 
 actually has made it to the safety of the raid card.

Wait, actually a good BBU RAID controller will disable the cache on the
drives. So everything that is cached is already on the controller vs.
the drives itself.

Or am I missing something?

Joshua D. Drake

 
-- 
PostgreSQL - XMPP: jdr...@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Christophe


On Mar 13, 2009, at 11:59 AM, Joshua D. Drake wrote:
Wait, actually a good BBU RAID controller will disable the cache on  
the

drives. So everything that is cached is already on the controller vs.
the drives itself.

Or am I missing something?


Maybe I'm missing something, but a BBU controller moves the safe  
point from the platters to the controller, but it doesn't move it all  
the way into the OS.


So, if the software calls fsync, but fsync doesn't actually push the  
data to the controller, you are still at risk... right?


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Scott Marlowe
On Fri, Mar 13, 2009 at 1:09 PM, Christophe x...@thebuild.com wrote:

 On Mar 13, 2009, at 11:59 AM, Joshua D. Drake wrote:

 Wait, actually a good BBU RAID controller will disable the cache on the
 drives. So everything that is cached is already on the controller vs.
 the drives itself.

 Or am I missing something?

 Maybe I'm missing something, but a BBU controller moves the safe point
 from the platters to the controller, but it doesn't move it all the way into
 the OS.

 So, if the software calls fsync, but fsync doesn't actually push the data to
 the controller, you are still at risk... right?

Ding!

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-13 Thread Marco Colombo
Scott Marlowe wrote:
 On Fri, Mar 13, 2009 at 1:09 PM, Christophe x...@thebuild.com wrote:
 So, if the software calls fsync, but fsync doesn't actually push the data to
 the controller, you are still at risk... right?
 
 Ding!
 

I've been doing some googling, now I'm not sure that not supporting barriers
implies not supporting (of lying) at blkdev_issue_flush(). It seems that
it's pretty common (and well-defined) for block devices to report
-EOPNOTSUPP at BIO_RW_BARRIER requests. device mapper apparently falls in
this category.

See:
http://lkml.org/lkml/2007/5/25/71
this is an interesting discussion on barriers and flushing.

It seems to me that PostgreSQL needs both ordered and synchronous
writes, maybe at different times (not that EVERY write must be both ordered
and synchronous).

You can emulate ordered with single+synchronous althought with a price.
You can't emulate synchronous writes with just barriers.

OPTIMAL: write-barrier-write-barrier-write-barrier-flush

SUBOPTIMAL: write-flush-write-flush-write-flush


As I understand it, fsync() should always issue a real flush: it's unrelated
to the barriers issue.
There's no API to issue ordered writes (or barriers) at user level,
AFAIK. (Uhm... O_DIRECT, maybe implies that?)

FS code may internally issue barrier requests to the block device, for
its own purposes (e.g. journal updates), but there's not useland API for
that.

Yet, there's no reference to DM not supporting flush correctly in the
whole thread... actually there are refereces to the opposite. DM devices
are defined as FLUSHABLE.

Also see:
http://lkml.org/lkml/2008/2/26/41
but it seems to me that all this discussion is under the assuption that
disks have write-back caches.
The alternative is to disable the disk write cache. says it all.

.TM.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-06 Thread Tom Lane
Jack Orenstein jack.orenst...@hds.com writes:
 The transaction rates I'm getting seem way too high: 2800-2900 with
 one thread, 5000-7000 with ten threads. I'm guessing that writes
 aren't really reaching the disk. Can someone suggest how to figure out
 where, below postgres, someone is lying about writes reaching the
 disk?

AFAIK there are two trouble sources in recent Linux machines: LVM and
the disk drive itself.  LVM is apparently broken by design --- it simply
fails to pass fsync requests.  If you're using it you have to stop.
(Which sucks, because it's exactly the kind of thing DBAs tend to want.)
Otherwise you need to reconfigure your drive to not cache writes.
I forget the incantation for that but it's in the PG list archives.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-06 Thread Greg Smith

On Fri, 6 Mar 2009, Tom Lane wrote:


Otherwise you need to reconfigure your drive to not cache writes.
I forget the incantation for that but it's in the PG list archives.


There's a dicussion of this in the docs now, 
http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html


hdparm -I lets you check if write caching is on, hdparm -W lets you toggle 
it off.  That's for ATA disks; SCSI ones can use sdparm instead, but 
usually it's something you can adjust more permanently in the card 
configuration or BIOS instead for those.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-06 Thread Ben Chobot

On Fri, 6 Mar 2009, Greg Smith wrote:


On Fri, 6 Mar 2009, Tom Lane wrote:


 Otherwise you need to reconfigure your drive to not cache writes.
 I forget the incantation for that but it's in the PG list archives.


There's a dicussion of this in the docs now, 
http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html


How does turning off write caching on the disk stop the problem with LVM? 
It still seems like you have to get the data out of the OS buffer, and if 
fsync() doesn't do that for you


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-06 Thread Scott Marlowe
On Fri, Mar 6, 2009 at 2:22 PM, Ben Chobot be...@silentmedia.com wrote:
 On Fri, 6 Mar 2009, Greg Smith wrote:

 On Fri, 6 Mar 2009, Tom Lane wrote:

  Otherwise you need to reconfigure your drive to not cache writes.
  I forget the incantation for that but it's in the PG list archives.

 There's a dicussion of this in the docs now,
 http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html

 How does turning off write caching on the disk stop the problem with LVM? It
 still seems like you have to get the data out of the OS buffer, and if
 fsync() doesn't do that for you

I think he was saying otherwise (if you're not using LVM and you still
have this super high transaction rate) you'll need to turn off the
drive's write caches.  I kinda wondered at it for a second too.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Maximum transaction rate

2009-03-06 Thread Greg Smith

On Fri, 6 Mar 2009, Ben Chobot wrote:


How does turning off write caching on the disk stop the problem with LVM?


It doesn't.  Linux LVM is awful and broken, I was just suggesting more 
details on what you still need to check even when it's not involved.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general