On Wed, Feb 3, 2016 at 11:12 AM, Amit Kapila <amit.kapil...@gmail.com>
> On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <jim.na...@bluetreble.com>
>> On 1/31/16 3:26 PM, Jan Wieck wrote:
>>> On 01/27/2016 08:30 AM, Amit Kapila wrote:
>>>> operation.  Now why OS couldn't find the corresponding block in
>>>> memory is that, while closing the WAL file, we use
>>>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
>>>> lead to this problem.  So with this experiment, the conclusion is that
>>>> though we can avoid re-write of WAL data by doing exact writes, but
>>>> it could lead to significant reduction in TPS.
>>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
>>> from OS buffers. If I am not mistaken we recycle WAL segments in a round
>>> robin fashion. In a properly configured system, where the reason for a
>>> checkpoint is usually "time" rather than "xlog", a recycled WAL file
>>> written to had been closed and not touched for about a complete
>>> checkpoint_timeout or longer. You must have a really big amount of spare
>>> RAM in the machine to still find those blocks in memory. Basically we
>>> are talking about the active portion of your database, shared buffers,
>>> the sum of all process local memory and the complete pg_xlog directory
>>> content fitting into RAM.
> I think that could only be problem if reads were happening at write or
> fsync call, but that is not the case here.  Further investigation on this
> point reveals that the reads are not for fsync operation, rather they
> happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
> Although this behaviour (writing in non-OS-page-cache-size chunks could
> lead to reads if followed by a call to posix_fadvise
> (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
> reason for the same is that fadvise() call maps the specified data range
> (which in our case is whole file) into the list of pages and then
> them which will further lead to removing them from OS cache, now any
> misaligned (w.r.t OS page-size) writes done during writing/fsyncing to
> could cause additional reads as everything written by us will not be on
> OS-page-boundary.

On further testing, it has been observed that misaligned writes could
cause reads even when blocks related to file are not in-memory, so
I think what Jan is describing is right.  The case where there is
absolutely zero chance of reads is when we write in OS-page boundary
which is generally 4K.  However I still think it is okay to provide an
option for  WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
for the cases when these are beneficial like when wal_level is
greater than equal to Archive and keep default as OS-page size if
the same is smaller than 8K.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply via email to