> > > What you really want is Solaris's free-behind, where it detects > > > if a scan is exceeding a certain percentage of the OS cache and > > > moves the pages to the _front_ of the to-be-reused list. I am > > > not sure what other OS's support this, but we need this on our > > > own buffer manager code as well. > > > > > > Our TODO already has: > > > > > > * Add free-behind capability for large sequential scans (Bruce) > > > > > > Basically, I think we need free-behind rather than O_DIRECT. > > > > I suppose, but you've already polluted the cache by the time the > > above mentioned mechanism kicks in and takes effect. Given that > > the planner has an idea of how much data it's going to read in in > > order to complete the query, seems easier/better to mark the fd > > O_DIRECT. *shrug* > > _That_ is an excellent point. However, do we know at the time we > open the file descriptor if we will be doing this?
Doesn't matter, it's an option to fcntl(). > What about cache coherency problems with other backends not opening > with O_DIRECT? That's a problem for the kernel VM, if you mean cache coherency in the VM. If you mean inside of the backend, that could be a stickier issue, I think. I don't know enough of the internals yet to know if this is a problem or not, but you're right, it's certainly something to consider. Is the cache a write behind cache or is it a read through cache? If it's a read through cache, which I think it is, then the backend would have to dirty all cache entries pertaining to the relations being opened with O_DIRECT. The use case for that being: 1) a transaction begins 2) a few rows out of the huge table are read 3) a huge query is performed that triggers the use of O_DIRECT 4) the rows selected in step 2 are updated (this step should poison or update the cache, actually, and act as a write through cache if the data is in the cache) 5) the same few rows are read in again 6) transaction is committed Provided the cache is poisoned or updated in step 4, I can't see how or where this would be a problem. Please enlighten if there's a different case that would need to be taken into account. I can't imagine ever wanting to write out data using O_DIRECT and think that it's a read only optimization in an attempt to minimize the turn over in the OS's cache. From fcntl(2): O_DIRECT Minimize or eliminate the cache effects of reading and writ- ing. The system will attempt to avoid caching the data you read or write. If it cannot avoid caching the data, it will minimize the impact the data has on the cache. Use of this flag can drastically reduce performance if not used with care. > And finally, how do we deal with the fact that writes to O_DIRECT > files will wait until the data hits the disk because there is no > kernel buffer cache? Well, two things. 1) O_DIRECT should never be used on writes... I can't think of a case where you'd want it off, even when COPY'ing data and restoring a DB, it just doesn't make sense to use it. The write buffer is emptied as soon as the pages hit the disk unless something is reading those bits, but I'd imagine the write buffer would be used to make sure that as much writing is done to the platter in a single write by the OS as possible, circumventing that would be insane (though useful possibly for embedded devices with low RAM, solid state drives, or some super nice EMC fiber channel storage device that basically has its own huge disk cache). 2) Last I checked PostgreSQL wasn't a threaded app and doesn't use non-blocking IO. The backend would block until the call returns, where's the problem? :) If anything O_DIRECT would shake out any bugs in PostgreSQL's caching code, if there are any. -sc -- Sean Chittenden ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings