Matt Clark wrote:
I'm thinking along the lines of an FS that's aware of PG's strategies and
requirements and therefore optimised to make those activities as efiicient
as possible - possibly even being aware of PG's disk layout and treating
files differently on that basis.

As someone else noted, this doesn't belong in the filesystem (rather the kernel's block I/O layer/buffer cache). But I agree, an API by which we can tell the kernel what kind of I/O behavior to expect would be good. The kernel needs to provide good behavior for a wide range of applications, but the DBMS can take advantage of a lot of domain-specific information. In theory, being able to pass that domain-specific information on to the kernel would mean we could get better performance without needing to reimplement large chunks of functionality that really ought to be done by the kernel anyway (as implementing raw I/O would require, for example). On the other hand, it would probably mean adding a fair bit of OS-specific hackery, which we've largely managed to avoid in the past.

The closest API to what you're describing that I'm aware of is posix_fadvise(). While that is technically-speaking a POSIX standard, it is not widely implemented (I know Linux 2.6 implements it; based on some quick googling, it looks like AIX does too). Using posix_fadvise() has been discussed in the past, so you might want to search the archives. We could use FADV_SEQUENTIAL to request more aggressive readahead on a file that we know we're about to sequentially scan. We might be able to use FADV_NOREUSE on the WAL. We might be able to get away with specifying FADV_RANDOM for indexes all of the time, or at least most of the time. One question is how this would interact with concurrent access (AFAICS there is no way to fetch the "current advice" on an fd...)

Also, I would imagine Win32 provides some means to inform the kernel about your expected I/O pattern, but I haven't checked. Does anyone know of any other relevant APIs?


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
     joining column's datatypes do not match

Reply via email to