Re: [HACKERS] pg_autovacuum next steps

Matthew T. O'Connor Sun, 21 Mar 2004 19:38:26 -0800

On Sun, 2004-03-21 at 18:12, Tom Lane wrote:
> "Matthew T. O'Connor" <[EMAIL PROTECTED]> writes:
> > [ rtf document ]
> 
> Please repost in some less proprietary format.  Plain text is generally
> considered the thing to use on this list.


I don't think RTF is proprietary but I should have just posted inline
anyway so here is a copy:

pg_autovacuum Version 2
Design Document:


Exec Summary:
pg_autovacuum was initially released as a contrib module in PostgreSQL
v7.4.   The version in 7.4 is by design very simple.  No configuration
is required, and very little configuration is possible.  Despite these
limitations it was voted the most popular new feature of PostgreSQL v7.4
according to the survey held on postgresql.org
(http://www.postgresql.org/survey.php?View=1&SurveyID=23).  

Despite it's popularity there is much room for improvement.  This
document sets out to define the most important improvements that would
help pg_autovacuum to become a truly powerful asset to the suite of
tools that come with PostgreSQL.


Current Problems & Limitations:

Based on user feedback from people using pg_autovacuum in the field, and
my own observations, there are a number of problems and limitation with
pg_autovacuum.  They are:

* Inability to customize thresholds on a per table basis
* Inability to set default thresholds on a per database basis
* Inability to exclude specific databases / tables from pg_autovacuum
monitoring
* Inability to schedule vacuums during off-peak times
* Lack of integration related to startup and shutdown
* Ignorance of VACUUM and ANALYZE operations performed outside
pg_autovacuum (requires backend integration? or can listen / notify can
be used?)
* Lack of logging options / syslog integration / log rotation options
* Create table fails because template1 is busy

I'm not sure how to address all of these concerns, or that they all
should be addressed right now.  One of my big questions is backend
integration.  I am leaning towards leaving pg_autovacuum as a client
application in contrib for one more release.  During this time, I can
continue to tweak and improve pg_autovacuum so that we will have a very
good idea what the final product should be before we make it a standard
backend process.


For PostgreSQL 7.5,  I plan to implement these new features:

 1.Per database defaults and per table thresholds (including total
exclusion)
 2.Persistent data
 3.Single-Pass Mode (external scheduling from cron etc...)
 4.Off peak scheduling

1. Per Database defaults and Per table Thresholds:

There are differing opinions as to the best way to providing these this
feature.  The primary debate is where to save the configuration data.  I
see three options:

 1.Store config data inside a special pg_autovacuum table inside
existing databases that wants custom settings.   

 2.Use a config file.  This would require some additional coding to add
the required parsing, but is possible. 

 3.Create a pg_autovacuum database inside any cluster that wants to
customize their settings.  

Since many people do not like tools that clutter their databases by
adding tables, I think option 1 (adding a pg_autovacuum table to
existing databases) is right out.  Using a config file would be Ok, but
would require additional parsing code.  My preference is option 3. 
Since pg_autovacuum will (hopefully) eventually become an integrated
part of the backend, it will eventually be able to add required data to
the system catalogs.  Given these two premises, as long as pg_autovacuum
remains a contrib module it could use it's own database to mimic having
system tables.  If this database exists, it will be used, if it does not
exist, then pg_autovacuum will work just as it did in the 7.4 release
with very limited options available to it.  The user will be able to
specify a non-default database.

Table Structure for database specific defaults and table specific
thresholds:

databases_defaults: (will reference the pg_class system table)
        id                                      serial primary key
        exclude_database                        boolean
        default_vacuum_scaling_factor   float
        default_vacuum_base_value       int
        default_analyze_scaling_factor  float
        default_analyze_base_value      int
        dboid                                   oid references pg_database.oid
        
table_thresholds
        id                      serial primary key
        exclude_table           boolean (exclude this table)
        vacuum_scaling_factor   float   (equivalent to  -v)
        vacuum_base_value       int     (equivalent to -V)
        vacuum_threshold        float   (if > 0, use this threshold)
        analyze_scaling_factor  float   (equivalent to -a)
        analyze_base_value      int     (equivalent to -A)
        analyze_threshold       float   (if > 0 use this threshold)
        relid                   oid references pg_classs.relid


2.Persistent pg_autovacuum Data:

Right now pg_autovacuum has no memory of what was going on the last time
it was run.  So if significant changes have happened while pg_autovacuum
is not running, they will not be counted in the analysis of when to
perform a vacuum or analyze operation which can result in under
vacuuming.  So, pg_autovacuum should occasionally write down it's
numbers to the database.  The data  will be stored in an additional
table called table_data

table_data
        id                      serial primary key
        CountAtLastAnalyze      long 
        CountAtLastVacuum       long
        table_thresholds_id     int references table_thresholds


3.Single-Pass Mode (External Scheduling):

I have received requests to be able to run pg_autovacuum only on request
(not as a daemon) making only one pass over all the tables (not looping
indefinately).  The advantage being that it will operate more like the
current vacuum command except that it will only vacuum tables that need
to be vacuumed.  This feature could be useful as long as pg_autovacuum
exists outside the backend.  If pg_autovacuum gets integrated into the
backend and gets automatically started as a daemon during startup, then
this option will no longer make sense.

Once we have persistent data (Step 2) then we can easily operate in
Single-Pass Mode.


4.Off-Peak Scheduling:

A fundamental advantage of our vacuum system is that the work required
to reclaim table space is taken out of the critical path and can be
moved to and off-peak time when cycles are less precious.  One of the
drawbacks of the current pg_autovacuum is that it doesn't have any way
to factor this in.

In it's simplest form (which I will implement first) I would add the
ability to add a second set of thresholds that will be active only
during an âoff-peakâ time that can be specified in the pg_autovacuum
database, perhaps in a general_settings table.





A few notes about things I'm not planning on working on at the moment.

Better logging options:

An additional logging could would be added to the pg_autovacuum database
and will log all activity (vacuums and analyzes) along with their
corresponding duration.

Syslog support.  I'm not sure this is really needed, but a simple patch
was submitted by one user and perhaps that can be reviewed / improved
and applied.





---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Re: [HACKERS] pg_autovacuum next steps

Reply via email to