posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as
well asn syncing a bunch of data in different files with a single call
(so that the OS can determine the best write order). I can also imagine
some interaction with the FS journalling system (to avoid duplicate
efforts).


There is also the fact that syncing after every transaction could be changed to syncing every N transactions (N fixed or depending on the data size written by the transactions) which would be more efficient than the current behaviour with a sleep. HOWEVER suppressing the sleep() would lead to postgres returning from the COMMIT while it is in fact not synced, which somehow rings a huge alarm bell somewhere.


What about read order ?
This could be very useful for SELECT queries involving indexes, which in case of a non-clustered table lead to random seeks in the table.
There's fadvise to tell the OS to readahead on a seq scan (I think the OS detects it anyway), but if there was a system call telling the OS "in the next seconds I'm going to read these chunks of data from this file (gives a list of offsets and lengths), could you put them in your cache in the most efficient order without seeking too much, so that when I read() them in random order, they will be in the cache already ?". This would be an asynchronous call which would return immediately, just queuing up the data somewhere in the kernel, and maybe sending a signal to the application when a certain percentage of the data has been cached.
PG could take advantage of this with not much code changes, simply by putting a fifo between the index scan and the tuple fetches, to wait the time necessary for the OS to have enough reads to cluster them efficiently.
On very large tables this would maybe not gain much, but on tables which are explicitely clustered, or naturally clustered like accessing an index on a serial primary key in order, it could be interesting.


        Just a thought.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Reply via email to