On 2016-02-18 09:51:20 +0100, Fabien COELHO wrote:
> I've looked at these patches, especially the whole bench of explanations and
> comments which is a good source for understanding what is going on in the
> WAL writer, a part of pg I'm not familiar with.
> When reading the patch 0002 explanations, I had the following comments:
> AFAICS, there are several levels of actions when writing things in pg:
>  0: the thing is written in some internal buffer
>  1: the buffer is advised to be passed to the OS (hint bits?)

Hint bits aren't related to OS writes. They're about information like
'this transaction committed' or 'all tuples on this page are visible'.

>  2: the buffer is actually passed to the OS (write, flush)
>  3: the OS is advised to send the written data to the io subsystem
>     (sync_file_range with SYNC_FILE_RANGE_WRITE)
>  4: the OS is required to send the written data to the disk
>     (fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)

We can't easily rely on sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER) -
the guarantees it gives aren't well defined, and actually changed across

0002 is about something different, it's about the WAL writer. Which
writes WAL to disk, so individual backends don't have to. It does so in
the background every wal_writer_delay or whenever a tranasaction
asynchronously commits.  The reason this interacts with checkpoint
flushing is that, when we flush writes on a regular pace, the writes by
the checkpointer happen inbetween the very frequent writes/fdatasync()
by the WAL writer. That means the disk's caches are flushed every
fdatasync() - which causes considerable slowdowns.  On a decent SSD the
WAL writer, before this patch, often did 500-1000 fdatasync()s a second;
the regular sync_file_range calls slowed down things too much.

That's what caused the large regression when using checkpoint
sorting/flushing with synchronous_commit=off. With that fixed - often a
performance improvement on its own - I don't see that regression anymore.

> After more considerations, my final understanding is that this behavior only
> occurs with "asynchronous commit", aka a situation when COMMIT does not wait
> for data to be really fsynced, but the fsync is to occur within some delay
> so it will not be too far away, some kind of compromise for performance
> where commits can be lost.


> Now all this is somehow alien to me because the whole point of committing is
> having the data to disk, and I would not consider a database to be safe if
> commit does not imply fsync, but I understand that people may have to
> compromise for performance.

It's obviously not applicable for every scenario, but in a *lot* of
real-world scenario a sub-second loss window doesn't have any actual
negative implications.


Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to