Hello, long time no see.

I'm sorry to interrupt your discussion. I'm afraid the code is getting
more complicated to continue to use fsync(). Though I don't intend to
say the current approach is wrong, could anyone evaluate O_SYNC
approach again that commercial databases use and tell me if and why
PostgreSQL's fsync() approach is better than theirs?

This January, I got a good result with O_SYNC, which I haven't
reported here. I'll show it briefly. Please forgive me for my abrupt
email, because I don't have enough time.
# Personally, I want to work in the community, if I'm allowed.
And sorry again. I reported that O_SYNC resulted in very bad
performance last year. But that was wrong. The PC server I borrowed
was configured that all the disks form one RAID5 device. So, the disks
for data and WAL (/dev/sdd and /dev/sde) came from the same RAID5
device, resulting in I/O conflict.

What I modified is md.c only. I just added O_SYNC to the open flag in
mdopen() and _mdfd_openseg(), if am_bgwriter is true. I didn't want
backends to use O_SYNC because mdextend() does not have to transfer
data to disk.

My evaluation environment was:

CPU: Intel Xeon 3.2GHz * 2 (HT on)
Memory: 4GB
Disk: Ultra320 SCSI (perhaps configured as write back)
OS: RHEL3.0 Update 6
Kernel: 2.4.21-37.ELsmp
PostgreSQL: 8.2.1

The relevant settings of PostgreSQL are:

shared_buffers = 2GB
wal_buffers = 1MB
wal_sync_method = open_sync
checkpoint_* and bgwriter_* parameters are left as their defaults.

I used pgbench, with the data of scaling factor 50.

[without O_SYNC, original behavior]
- pgbench -c1 -t16000
  best response: 1ms
  worst response: 6314ms
  10th worst response: 427ms
  tps: 318
- pgbench -c32 -t500
  best response: 1ms
  worst response: 8690ms
  10th worst response: 8668ms
  tps: 330

[with O_SYNC]
- pgbench -c1 -t16000
  best response: 1ms
  worst response: 350ms
  10th worst response: 91ms
  tps: 427
- pgbench -c32 -t500
  best response: 1ms
  worst response: 496ms
  10th worst response: 435ms
  tps: 1117

If the write back cache were disabled, the difference would be
Windows version showed similar improvements.

However, this approach has two big problems.

(1) Slow down bulk updates

Updates of large amount of data get much slower because bgwriter seeks
and writes dirty buffers synchronously page-by-page. For example:

- COPY of accounts (5m records) and CHECKPOINT command after COPY
  without O_SYNC: 100sec
  with O_SYNC: 1046sec
- UPDATE of all records of accounts
  without O_SYNC: 139sec
  with O_SYNC: 639sec
- CHECKPOINT command for flushing 1.6GB of dirty buffers
  without O_SYNC: 24sec
  with O_SYNC: 126sec

To mitigate this problem, I sorted dirty buffers by their relfilenode
and block numbers and wrote multiple pages that are adjacent both on
memory and on disk. The result was:

- COPY of accounts (5m records) and CHECKPOINT command after COPY

- UPDATE of all records of accounts
- CHECKPOINT command for flushing 1.6GB of dirty buffers

Still bad...

(2) Can't utilize tablespaces

Though I didn't evaluate, update activity would be much less efficient
with O_SYNC than with fsync() when using multiple tablespaces, because
there is only one bgwriter.

Anyone can solve these problems?
One of my ideas is to use scattered I/O. I hear that readv()/writev()
became able to do real scattered I/O since kernel 2.6 (RHEL4.0).  With
kernels before 2.6, readv()/writev() just performed I/Os sequentially.
Windows has provided reliable scattered I/O for years.

Another idea is to use async I/O, possibly combined with multiple
bgwriter approach on platforms where async I/O is not available. How
about the chance Josh-san has brought?

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to