I've just uploaded

http://developer.postgresql.org/~wieck/all_performance.v4.74.diff.gz

This patch contains the "still not yet ready" performance improvements discussed over the couple last days.

_Shared buffer replacement_:

The buffer replacement strategy is a slightly modified version of ARC. The modifications are some specializations about CDB promotions. Since PostgreSQL allways looks for buffers multiple times when updating (first during the scan, then during the heap_update() etc.), every updated block would jump right into the T2 (frequent accessed) queue. To prevent that the Xid when a buffer got added to the T1 queue is remembered and if a block is found in T1, the same transaction will not promote it into T2. This also affects blocks accessed like SELECT ... FOR UPDATE; UPDATE as this is a usual strategy and does not mean that this particular datum is accessed frequently.

Blocks faulted in by vacuum are handled special in that they end up at the LRU of the T1 queue and when evicted from there their CDB get's destroyed instead of added to the B1 queue to prevent vacuum from polluting the caches autotuning.

A guc variable

buffer_strategy_status_interval = 0 # 0-600 seconds

controls DEBUG1 messages every n seconds showing the current queue sizes and the cache hitrates during the last interval.


_Vacuum page delay_:


Tom Lane's napping during vacuums with another tuning option. I replaced the usleep() call with a PG_DELAY(msec) macro in miscadmin.h, which does use select(2) instead. That should address the possible portability problems.

The config options

    vacuum_page_group_delay = 0  # 0-100 milliseconds
    vacuum_page_group_size  = 10 # 1-1000 pages

control how many pages get vacuumed as a group and how long vacuum will nap between groups.

I think this can be improved more if vacuum get's feedback from the buffer manager if a page actually was found clean or already dirty in the cache or faulted in. This together with the fact if vacuum actually dirties the page or not would result in a sort of "vacuum page cost" that is accumulated and controls how often to nap. So that vacuuming a page found in the cache and that has no dead tuples is cheap, but vacuuming a page that caused another dirty block to get evicted, then read in and finally ends up dirty because of dead tuples is expensive.


_Lazy checkpoint_:


This is the checkpoint process with the ability to schedule the buffer flushing over some time. Also the buffers are written in an order told by the buffer replacement strategy. Currently that is a merged list of dirty buffers in the order of the T1 and T2 queues of ARC. Since buffers are replaced in that order, it causes backends to find clean buffers for eviction more often.

The config options

    lazy_checkpoint_time = 0        # 0-3600 seconds
    lazy_checkpoint_group_size = 50 # 10-1000 pages
    lazy_checkpoint_maxdelay = 500  # 100-1000 milliseconds

control how long the buffer flushing "should" take, how many dirty pages to write as a group before syncing and napping. The maxdelay is a parameter that causes really small amounts of changes not to spread out over that long.

The syncing is currently done in a new function in md.c, mdfsyncrecent() called through the smgr. The intention is to maintain some LRU of written to file descriptors and pg_fdatasync() them. I haven't found the right place for that yet, so it simply does a system global sync().

My idea here is that it really does not matter how accurate the single files are forced to disk during this, all we care for is to cause some physical writes performed by the kernel while we're writing them out, and not to buffer those writes in the OS until we finish the checkpoint.

The lazy checkpoint configuration should only affect automatic checkpoints started by postmaster because a checkpoint_timeout occured. Acutally it seems to apply this to manually started checkpoints as well. BufferSync() monitors the time to finish, held in shared memory, so it would be relatively easy to hurry up a running lazy checkpoint by setting that to zero. It's just that the postmaster can't do that because he does not have a PGPROC structure and therefore can't lock that shmem structure. This is a must fix item because to hurry up the checkpointer is very critical at shutdown time.


_TODO_:


* Replace the global sync() in mdfsyncrecent(int max) with calls to
  pg_fdatasync()

* Add functionality to postmaster to hurry up a running checkpoint
  at shutdown.

* Make sure that manual checkpoints are not affected by the lazy
  checkpoint config options and that they too hurry up a running one.

* Further improve vacuums napping strategy depending on actual caused
  IO per page.


_NOTE_:


The core team is well aware of the high demand for these features. As things stand however, it is impossible to get this functionality released in version 7.4.

That does not mean, that we have no chance to include some or all of the functionality in a subsequent 7.4.x release. But for that to happen, the above already mentioned TODO's must get done first. Further, we need a good amount of evidence that these changes actually gain the desired effect to a degree that justifies breaking our "no features in dot releases" rule. Also we need a good amount of evidence that the features don't break anything or sacrifice stability and that a backward compatible behaviour (where possible ... not possible with ARC vs. LRU) is the default.

I personally would like to see this work included in a 7.4.x release. But it requires people to actually run tests, stress some hardware, check platform portability and *give us feedback*, bacause this is what we get for the release candidates and these improvements can under no circumstance have any lower quality than that. If this goes into a 7.4.x release and there is any platform dependant issue in it, it endangers the timely fix of other bugs for those platforms, and that's a no-go.


Happy testing



Jan


--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== [EMAIL PROTECTED] #


---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Reply via email to