On Fri, 2005-06-03 at 10:37 +1000, Neil Conway wrote: > On Thu, 2005-06-02 at 11:49 -0700, Mary Edie Meredith wrote: > > My understanding is that O_DIRECT means "direct" as in "no buffering by > > the OS" which implies that if you write from your buffer, the write is > > not going to return unless the OS thinks the write is completed > > Right, I think that's definitely the case. The question is whether a > write() under O_DIRECT will also flush the disk's write cache -- i.e. > when the write() completes, we need it to be durable over a spontaneous > power loss. fsync() or O_SYNC should provide this (modulo braindamaged > IDE hardware), but I wouldn't be surprised if O_DIRECT by itself will > not (otherwise you would hurt the performance of applications using > O_DIRECT that don't need these durability guarantees).
My understanding is that for Linux, with respect to "Guaranteed writes" a write with the fd opened as O_DIRECT behaves the _same as a write/fsync on an fd opened without O_DIRECT, i.e. whether the write completes all the way to the disk itself depends on when the particular device responds to those equivalent sequences. Quoting from the Capabilities Document "'Guarantee a write completion ' means the operating system has issued a write to the I/O subsystem, and the device has returned an affirmative response. Once an affirmative response is sent, recovery from power down without data loss is the responsibility of the I/O subsystem." Don't most disk drives have a battery backup so that it can flush its cache if power is lost? Ditto for Disk arrays with fancier cache and write-back set on (not advised for the paranoid). Looking at this from another angle, is there really any way that you can say a write is truly guaranteed in the event of a failure? I think in the end to be safe, you cannot. That's why (and I'm not telling you anything new) there is no substitute for backups and log archiving for databases. Databases must be able to recognize the last _good transaction logged and roll forward to that from the backup (including detecting partial writes to the log). I'm sure the PostgreSQL community has worked hard to do the equivalent of that within the PostgreSQL architecture. > > > Bottom line: if you do not implement direct/async IO so that you > > optimize caching of hot database objects and minimize memory utilization > > of objects used once, you are probably leaving performance on the table > > for datafiles. > > Absolutely -- patches are welcome :) How about testing patches (--: > I agree async IO + O_DIRECT in some > form would be interesting, but the changes required are far from trivial > -- my guess is there are lower hanging fruit. Since the log has to be sequential, I think you are on the right track! Believe me, I didn't mean to imply that it is trivial to implement. For those databases that have async/direct, the functionality appeared over a span of several major versions. I just thought I detected an opinion that it would not help. Sorry for the misunderstanding. I absolutely don't mean to sound critical. At OSDL we have the greatest respect for the PostgreSQL community. > > -Neil -- Mary Edie Meredith [EMAIL PROTECTED] 503-906-1942 Data Center Linux Initiative Manager Open Source Development Labs ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]