Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

Amit Kapila Mon, 13 Feb 2012 00:40:16 -0800

Dan, I believe your approach of double buffer write is right as it has 
potential that it can avoid the latency backends incur during full page writes 
after checkpoint. Although there are chances that overall I/O will be more in 
this case but if we can make sure that in most scenarios backend has to never 
do I/O it can show performance improvement as well as compare to full page 
writes.

-----Original Message-----
From: Dan Scales [mailto:[email protected]] 
Sent: Thursday, February 09, 2012 5:30 AM
To: Amit Kapila
Cc: PG Hackers
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

> Is there any problem if the double-write happens only by bgwriter or 
> checkpoint. 
> Something like whenever backend process has to evict the buffer, it will do 
> same as you have described that write in a double-write buffer, but > 
> bgwriter  will check this double-buffer and flush from it.
> Also whenever any backend will see that the double buffer is more than 2/3rd 
> or some threshhold value full it will tell bgwriter to flush from > 
> double-write buffer.
> This can ensure very less I/O by any backend.

Yes, I think this is a good idea.  I could make changes so that the backends 
hand off the responsibility to flush batches of the double-write buffer to the 
bgwriter whenever possible.  This would avoid some long IO waits in the 
backends, though the backends may of course eventually wait anyways for the 
bgwriter if IO is not fast enough.  I did write the code so that any process 
can write a completed batch if the batch is not currently being flushed (so as 
to deal with crashes by backends).  Having the backends flush the batches as 
they fill them up was just simpler for a first prototype.

Dan

----- Original Message -----
From: "Amit Kapila" <[email protected]>
To: "Dan Scales" <[email protected]>, "PG Hackers" 
<[email protected]>
Sent: Tuesday, February 7, 2012 1:08:49 AM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

>> I think it is a good idea, and can help double-writes perform better in the 
>> case of lots of backend evictions.
   I don't understand this point, because from the data in your mail, it 
appears that when shared buffers are less means when more evictions can happen, 
the performance is less.

ISTM that the performance is less incase shared buffers size is less because 
I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or 
checkpoint. 
Something like whenever backend process has to evict the buffer, it will do 
same as you have described that write in a double-write buffer, but bgwriter  
will check this double-buffer and flush from it.
Also whenever any backend will see that the double buffer is more than 2/3rd or 
some threshhold value full it will tell bgwriter to flush from double-write 
buffer.
This can ensure very less I/O by any backend.

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using "double-write buffer" approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had 
proposed (as an alternative to a previous patch that only batched up writes by 
the checkpointer).  I think it is a good idea, and can help double-writes 
perform better in the case of lots of backend evictions.
It also centralizes most of the code change in smgr.c.  However, it is trickier 
to reason about.

The idea is that all page writes generally are copied to a double-write buffer, 
rather than being immediately written.  Note that a full copy of the page is 
required, but can folded in with a checksum calculation.
Periodically (e.g. every time a certain-size batch of writes have been added), 
some writes are pushed out using double writes -- the pages are first written 
and fsynced to a double-write file, then written to the data files, which are 
then fsynced.  Then double writes allow for fixing torn pages, so 
full_page_writes can be turned off (thus greatly reducing the size of the WAL 
log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big
    enough batch has accumulated, then flush the batch using double
    writes.  [I don't think I need to intercept calls to smgrextend(),
    but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a
    particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the
    current contents of the double-write buffer are flushed.

4.  Pass flags around in some cases to indicate whether a page buffer
    needs a double write or not.  (I think eventually this would be an
    attribute of the buffer, set when the page is WAL-logged, rather than
    a flag passed around.)

5.  Deal with duplicates in the double-write buffer appropriately (very
    rarely happens).

To get good performance, I needed to have two double-write buffers, one for the 
checkpointer and one for all other processes.  The double-write buffers are 
circular buffers.  The checkpointer double-write buffer is just a single batch 
of 64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches 
of 64 pages each.  Each batch goes to a different double-write file, so that 
they can be issued independently as soon as each batch is completed.  Also, I 
need to sort the buffers being checkpointed by file/offset (see ioseq.c), so 
that the checkpointer batches will most likely only have to write and fsync one 
data file.

Interestingly, I find that the plot of tpm for DBT2 is much smoother (though 
still has wiggles) with double writes enabled, since there are no unpredictable 
long fsyncs at the end (or during) a checkpoint.

Here are performance numbers for double-write buffer (same configs as previous 
numbers), for 2-processor, 60-minute 50-warehouse DBT2.  One the right shows 
the size of the shared_buffers, and the size of the RAM in the virtual machine. 
 FPW stands for full_page_writes, DW for double_writes.  'two disk' means the 
WAL log is on a separate ext3 filesystem from the data files.

           FPW off FPW on  DW on, FPW off
one disk:  15488   13146   11713                    [5G buffers, 8G VM]
two disk:  18833   16703   18013

one disk:  12908   11159    9758                    [3G buffers, 6G VM]
two disk:  14258   12694   11229

one disk   10829    9865    5806                    [1G buffers, 8G VM]
two disk   13605   12694    5682

one disk:   6752    6129    4878
two disk:   7253    6677    5239                    [1G buffers, 2G VM]

The performance of DW on the small cache cases (1G shared_buffers) is now much 
better, though still not as good as FPW on.  In the medium cache case (3G 
buffers), where there are significant backend dirty evictions, the performance 
of DW is close to that of FPW on.  In the large cache (5G buffers), where the 
checkpointer can do all the work and there are minimal dirty evictions, DW is 
much better than FPW in the two disk case.
In the one disk case, it is somewhat worse than FPW.  However, interestingly, 
if you just move the double-write files to a separate ext3 filesystem on the 
same disk as the data files, the performance goes to
13107 -- now on par with FPW on.  We are obviously getting hit by the
ext3 fsync slowness issues.  (I believe that an fsync on a filesystem can stall 
on other unrelated writes to the same filesystem.)

Let me know if you have any thoughts/comments, etc.  The patch is enclosed, and 
the README.doublewrites is updated a fair bit.

Thanks,

Dan

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

Reply via email to