On Mon, Jun 2, 2014 at 6:04 PM, Fujii Masao <masao.fu...@gmail.com> wrote: > On Wed, May 28, 2014 at 1:10 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > IIUC in DBW mechanism, we need to have a temporary sequential > > log file of fixed size which will be used to write data before the data > > gets written to its actual location in tablespace. Now as the temporary > > log file is of fixed size, the number of pages that needs to be read > > during recovery should be less as compare to FPW because in FPW > > it needs to read all the pages written in WAL log after last successful > > checkpoint. > > Hmm... maybe I'm misunderstanding how WAL replay works in DBW case. > Imagine the case where we try to replay two WAL records for the page A and > the page has not been cached in shared_buffers yet. If FPW is enabled, > the first WAL record is FPW and firstly it's just read to shared_buffers. > The page doesn't neeed to be read from the disk. Then the second WAL record > will be applied. > > OTOH, in DBW case, how does this example case work? I was thinking that > firstly we try to apply the first WAL record but find that the page A doesn't > exist in shared_buffers yet. We try to read the page from the disk, check > whether its CRC is valid or not, and read the same page from double buffer > if it's invalid. After reading the page into shared_buffers, the first WAL > record can be applied. Then the second WAL record will be applied. Is my > understanding right?
I think the way DBW works is that before reading WAL, it first makes data pages consistent. It will first check the doublewrite buffer contents and pages in their original location. If page is inconsistent in double write buffer it is simply discarded, if it is inconsistent in the tablespace it is recovered from double write buffer. After reaching the double buffer end, it will start reading WAL. So in above example case, it will read the first record from WAL and check if page is already in shared_buffers, then apply WAL change, else read the page into shared_buffers, then apply WAL. For second record, it doesn't need to read the page. The saving during recovery will come from the fact that in case of DBW, it will not read the FPI from WAL, rather just 2 records (it has to read a WAL page, but that will contain many records). So it seems to be a net win. Now incase of DBW, the extra workdone (reading the double buffer, checking the consistency of same with actual page) is always fixed as size of double buffer is fixed, so the impact due to it should be much less than reading FPI's from WAL after last successful checkpoint. If my above understanding is right, then performance of recovery should be better with DBW in most cases. I think the cases where DBW might need to take care is when there are lot of backend evictions. For such scenario's backend might itself need to write both to double buffer and actual page. It can have more impact during bulk reads (when it has to set hint bit) and Vacuum which gets performed in ring buffer. One of the improvement that can be done here is to change the buffer eviction algorithm such that it can give up the buffer which needs to be written to double buffer. There can be other improvements as well depending on DBW implementation. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com