> > > What you really want is Solaris's free-behind, where it detects
> > > if a scan is exceeding a certain percentage of the OS cache and
> > > moves the pages to the _front_ of the to-be-reused list.  I am
> > > not sure what other OS's support this, but we need this on our
> > > own buffer manager code as well.
> > > 
> > > Our TODO already has:
> > > 
> > >   * Add free-behind capability for large sequential scans (Bruce)
> > > 
> > > Basically, I think we need free-behind rather than O_DIRECT.
> > 
> > I suppose, but you've already polluted the cache by the time the
> > above mentioned mechanism kicks in and takes effect.  Given that
> > the planner has an idea of how much data it's going to read in in
> > order to complete the query, seems easier/better to mark the fd
> > O_DIRECT.  *shrug*
> _That_ is an excellent point.  However, do we know at the time we
> open the file descriptor if we will be doing this?

Doesn't matter, it's an option to fcntl().

> What about cache coherency problems with other backends not opening
> with O_DIRECT?

That's a problem for the kernel VM, if you mean cache coherency in the
VM.  If you mean inside of the backend, that could be a stickier
issue, I think.  I don't know enough of the internals yet to know if
this is a problem or not, but you're right, it's certainly something
to consider.  Is the cache a write behind cache or is it a read
through cache?  If it's a read through cache, which I think it is,
then the backend would have to dirty all cache entries pertaining to
the relations being opened with O_DIRECT.  The use case for that

1) a transaction begins
2) a few rows out of the huge table are read
3) a huge query is performed that triggers the use of O_DIRECT
4) the rows selected in step 2 are updated (this step should poison or
   update the cache, actually, and act as a write through cache if the
   data is in the cache)
5) the same few rows are read in again
6) transaction is committed

Provided the cache is poisoned or updated in step 4, I can't see how
or where this would be a problem.  Please enlighten if there's a
different case that would need to be taken into account.  I can't
imagine ever wanting to write out data using O_DIRECT and think that
it's a read only optimization in an attempt to minimize the turn over
in the OS's cache.  From fcntl(2):

     O_DIRECT     Minimize or eliminate the cache effects of reading and writ-
                  ing.  The system will attempt to avoid caching the data you
                  read or write.  If it cannot avoid caching the data, it will
                  minimize the impact the data has on the cache.  Use of this
                  flag can drastically reduce performance if not used with

> And finally, how do we deal with the fact that writes to O_DIRECT
> files will wait until the data hits the disk because there is no
> kernel buffer cache?

Well, two things.

1) O_DIRECT should never be used on writes... I can't think of a case
   where you'd want it off, even when COPY'ing data and restoring a
   DB, it just doesn't make sense to use it.  The write buffer is
   emptied as soon as the pages hit the disk unless something is
   reading those bits, but I'd imagine the write buffer would be used
   to make sure that as much writing is done to the platter in a
   single write by the OS as possible, circumventing that would be
   insane (though useful possibly for embedded devices with low RAM,
   solid state drives, or some super nice EMC fiber channel storage
   device that basically has its own huge disk cache).

2) Last I checked PostgreSQL wasn't a threaded app and doesn't use
   non-blocking IO.  The backend would block until the call returns,
   where's the problem?  :)

If anything O_DIRECT would shake out any bugs in PostgreSQL's caching
code, if there are any.  -sc

Sean Chittenden

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Reply via email to