(2013/07/19 0:41), Greg Smith wrote:
On 7/18/13 11:04 AM, Robert Haas wrote:
On a system where fsync is sometimes very very slow, that
might result in the checkpoint overrunning its time budget - but SO
WHAT?

Checkpoints provide a boundary on recovery time.  That is their only purpose.
You can always do better by postponing them, but you've now changed the 
agreement
with the user about how long recovery might take.
Recently, a user who think system availability is important uses synchronous replication cluster. And, as Robert saying, a user who cannot build cluster system will not use this function in GUC.

When it became IO busy in calling fsync(), my patch does not take the over IO load in fsync(). Actually, it is the same as OS writeback structure. I read kernel source code which is fs/fs-writeback.c in linux-2.6.32-358.0.1.el6. It is latest RHEL6.4 kernel code. It seems that wb_writeback() controlled disk IO in OS-writeback function. Please see under source code. If OS think IO is busy, it does not write more IO for bail.

fs/fs-writeback.c @wb_writeback()
 623                 /*
 624                  * For background writeout, stop when we are below the
 625                  * background dirty threshold
 626                  */
 627                 if (work->for_background && !over_bground_thresh())
 628                         break;
 629
 630                 wbc.more_io = 0;
 631                 wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 632                 wbc.pages_skipped = 0;
 633
 634                 trace_wbc_writeback_start(&wbc, wb->bdi);
 635                 if (work->sb)
 636                         __writeback_inodes_sb(work->sb, wb, &wbc);
 637                 else
 638                         writeback_inodes_wb(wb, &wbc);
 639                 trace_wbc_writeback_written(&wbc, wb->bdi);
 640                 work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 641                 wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 642
 643                 /*
 644                  * If we consumed everything, see if we have more
 645                  */
 646                 if (wbc.nr_to_write <= 0)
 647                         continue;
 648                 /*
 649                  * Didn't write everything and we don't have more IO, bail
 650                  */
 651                 if (!wbc.more_io)
 652                         break;
 653                 /*
 654                  * Did we write something? Try for more
 655                  */
 656                 if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 657                         continue;
 658                 /*
 659                  * Nothing written. Wait for some inode to
 660                  * become available for writeback. Otherwise
 661                  * we'll just busyloop.
 662                  */
 663                 spin_lock(&inode_lock);
 664                 if (!list_empty(&wb->b_more_io))  {
 665                         inode = list_entry(wb->b_more_io.prev,
 666                                                 struct inode, i_list);
 667                         trace_wbc_writeback_wait(&wbc, wb->bdi);
 668                         inode_wait_for_writeback(inode);
 669                 }
 670                 spin_unlock(&inode_lock);
 671         }
 672
 673         return wrote;

I want you to read especially point that is line 631, 651, and 656. MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte). OS writeback scheduler does not write over MAX_WRITEBACK_PAGES. Because, if it write big data than MAX_WRITEBACK_PAGES, it will be IO-busy. And if it cannot write at all, OS think it needs recovery of IO performance. It is same as my patch's logic.

In addition, if you set a large value of a checkpoint_timeout or checkpoint_complete_taget, you have said that performance is improved, but is it true in all the cases? Since the write of the dirty buffer which passed 30 seconds or more is carried out at intervals of 5 seconds, as there are many recesses of a write, a possibility of becoming an inefficient random write. For example, as for the worsening case, when the sleep time for 200 ms is inserted each time, since only 25 page (200 KB) can write in 5 seconds. I think it is bad efficiency to write. When a checkpoint complication target is actually enlarged, performance may fall in some cases. I think this as the last fsync having become heavy owing to having write in slowly.

I would like to make a itemizing list which can be proof of my patch from you. Because DBT-2 benchmark spent lot of time about 1 setting test per 3 - 4 hours. Of course, I think it is important to obtain your consent.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to