> For my part, while that's certainly an interesting idea, it's far
> more
> complicated than even providing GUCs and the idea is to make PG just
> "do
> it right", not to offer the user more ways to get it wrong...

Yes, please let's not replace the existing too-simplistic knobs with giant 
complicated gadgets nobody, including us, understands.

For my part, over the last 3 years of consulting and dealing with 
postgresql.conf settings for more than 140 clients:

* only 10% of them ever touched the autoanalyze settings at all
* of the ~~ 14 who did:
   * 1 improved the tuning of their database
   * 3 of them messed up autoanalyze, causing stats and vacuum issues
   * ~~ 10 had no measurable effect

... so you'll understand when I say that I don't think ease of knob-twiddling 
is a priority for autoanalyze design.  In fact, I'd say that removing the knobs 
entirely is a design goal.

I've been going over the notes and email archives from the period where Matt 
O'Connor and I arrived at the current settings.  All of our testing was devoted 
to autovacuum, not autoanalyze.  The threshold+scale_factor design works pretty 
well for autovacuum; it prevents us from constantly vacuuming small tables, or 
large tables with less than 20% dead rows.  And I did extensive testing using 
DBT2 on OSDL to set the current defaults.

Our mistake was assuming that the same formula which worked well for vacuum 
would work well for analyze.  And since the DBT2 database has entirely 
medium-sized tables full of random data, no shortcomings in this thinking 
showed up in the tests.  Since the only counterproposal at the time was to have 
a flat percentage without a threshold, we got the current defaults.

So, problem #1 is coming up with a mathematical formula.  My initial target 
values are in terms of # of rows in the table vs. # of writes before analyze is 
triggered:

1 : 3
10 : 5
100 : 10
1000 : 100
100000 : 2000
1000000 : 5000
10000000 : 25000
100000000 : 100000

.... etc.  So problem #1 is a mathematical formula which gives this kind of 
curve.  I've tried some solution-seeking software, but I don't know how to use 
it well enough to get something useful.

Second problem is actually testing the result.  At this point, we don't have 
any performance tests which create anything other than fairly randomly 
distributed data, which doesn't tend to show up any issues in analyze.  We 
really need a performance test where new data is skewed and unbalanced, 
including tables of radically different sizes, and where we're set up to 
measure the level of inaccuracy in query statistics.  

Hmmm.  Actually, for measuring the innacuracy, I have some tools thanks to 
David Wheeler.  But not to generate the test in the first place.

--Josh Berkus


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to