On Thu, 11 Sep 2003, Christopher Browne wrote:
> [EMAIL PROTECTED] ("scott.marlowe") writes:
> > On Thu, 11 Sep 2003, Tom Lane wrote:
> >> Christopher Browne <[EMAIL PROTECTED]> writes:
> >> > The "right answer" for most use seems likely to involve:
> >> > a) Getting an appropriate number of bins (I suspect 10 is a bit
> >> > small, but I can't justify that mathematically), and
> >> I suspect that also, but I don't have real evidence for it either.
> >> We've heard complaints from a number of people for whom it was indeed
> >> too small ... but that doesn't prove it's not appropriate in the
> >> majority of cases ...
> >> > Does the sample size change if you increase the number of bins?
> >> Yes, read the comments in backend/commands/analyze.c.
> >> > Do we also need a parameter to control sample size?
> >> Not if the paper I read before writing that code is correct.
> > I was just talking to a friend of mine who does statistical analysis, and
> > he suggested a different way of looking at this. I know little of the
> > analyze.c, but I'll be reading it some today.
> > His theory was that we can figure out the number of target bins by
> > basically running analyze twice with two different random seeds, and
> > initially setting the bins to 10.
> > The, compare the variance of the two runs. If the variance is great,
> > increase the target by X, and run two again. repeat, wash, rinse, until
> > the variance drops below some threshold.
> > I like the idea, I'm not at all sure if it's practical for Postgresql to
> > implement it.
> It may suffice to do some analytic runs on some "reasonable datasets"
> in order to come up with a better default than 10.
> If you run this process a few times on some different databases and
> find that the variance keeps dropping pretty quickly, then that would
> be good material for arguing that 10 should change to 17 or 23 or 31
> or some such value. (The only interesting pttern in that is that
> those are all primes :-).)
That's a good intermediate solution, but it really doesn't solve
everyone's issue. If one table/field has a nice even distribution (i.e.
10 rows with id 1, 10 rows with id2, so on and so on) then it won't need
nearly as high of a default target as a row with lots of weird spikes and
such in it.
That's why Joe (my statistics friend) made the point about iterating over
each table with higher targets until the variance drops to something
I would imagine a simple script would be a good proof of concept of this,
but in the long run, it would be a huge win if the analyze.c code did this
automagically eventually, so that you don't have a target that's still too
low for some complex data sets and too high for simple ones.
Well, time for me to get to work on a proof of concept...
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]