(2013/12/05 23:42), Greg Stark wrote:
On Thu, Dec 5, 2013 at 8:35 AM, KONDO Mitsumasa
<kondo.mitsum...@lab.ntt.co.jp> wrote:
Yes. And using something efficiently DirectIO is more difficult than
BufferedIO.
If we change write() flag with direct IO in PostgreSQL, it will execute
hardest ugly randomIO.

Using DirectIO presumes you're using libaio or threads to implement
prefetching and asynchronous I/O scheduling.

I think in the long term there are only two ways to go here. Either a)
we use DirectIO and implement an I/O scheduler in Postgres or b) We
use mmap and use new system calls to give the kernel all the
information Postgres has available to it to control the I/O scheduler.
I agree with part of (b) method. I think MMAP API isn't purpose for controling I/O as others saying. And I think posix_fadivse(), sync_file_range() and fallocate() is easier way to be realized better I/O sheduler in Postgres. These systemcall doesn't cause data corruption at all, and we can just use existing implementaion. They effect only perfomance.

My survey of posix_fadvise() and sync_file_range() is here. It's simple rule.
#Almost my explaining is written in linux man:-)

* Optimize readahead in OS [ posix_fadvise() ]
  These options is for mainly read perfomance.

  - POSIX_FADV_SEQUENTIAL flag
    -> Readahead parameter in OS becomes maximum.
  - POSIX_FADV_RANDOM flag
    -> Don't use readahead parameter in OS. It can calculate the file cache
       frequency and efficiency for using the file cache.
  - POSIX_FADV_NORMAL
    -> Readahead parameter in OS optimized dynamically in each situasions. If
       you doesn't judge strategy of disk controlling, we can select this
       option. It might be good working in almost cases.

* Contorol dirty or clean buffer in OS [ posix_fadvise() and sync_file_range() ]
  These optinos is for write and read perfomance controling in OS file caches.

  - POSIX_FADV_DONTNEED
   -> Drop the file cache. If it is dirty, write disk and drop file cache.
      If it isn't dirty, it only drop from OS file cache.
  - sync_file_range()
   -> If you want to write dirty buffer to disk and remain file cache in OS, you
   can select this system-call. And it can contorol amount of write size.
  - POSIX_FADV_NOREUSE
   -> If you think that the file cache will not be needed, we can set this
   option. The file cache will be drop soon.
  - POSIX_FADV_WILLNEED
   -> If you think that the file cache will be important, we can set this
   option. The file cache will be tend to remain in OS file caches.


That's all.

Kernel in OS cannot predict IO pattern perfectly in each midlleware, therefore it is optimized by general heuristic algorithms. I think it is right way. However, PostgreSQL can predict IO pattern in part of planner, executer and checkpointer, so we had better set optimum posix_fadvise() flag or sync_file_range() before/after execute general IO systemcall. I think that they will be good IO contoroling and scheduling method without unreliable implementations.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to