On Mon, Aug 6, 2012 at 6:09 PM, Heikki Linnakangas < heikki.linnakan...@enterprisedb.com> wrote:
> On 04.08.2012 12:31, Alexander Korotkov wrote: > >> Hackers, >> >> attached patch is for collecting statistics and selectivity estimation for >> ranges. >> >> In order to make our estimations accurate for every distribution of >> ranges, we would collect 2d-distribution of lower and upper bounds of >> range >> into some kind of 2d-histogram. However, this patch use some >> simplification >> and assume distribution of lower bound and distribution of length to be >> independent. >> > > Sounds reasonable. Another possibility would be to calculate the average > length for each lower-bound bin. So you would e.g know the average length > of values with lower bound between 1-10, and the average length of values > with lower bound between 10-20, and so forth. Within a bin, you would have > to assume that the distribution of the lengths is fixed. > Interesting idea. AFAICS, if we store average length for each lower-bound bin, we still have to assume some kind of distribution of range length in order to do estimates. For example, assume that range length have exponential distribution. Correspondingly, we've following trade off: we don't have to assume lower bound distribution to be independent from length distribution, but we have to assume kind of length distribution. Actually, I don't know what is better. Ideally, we would have range length histogram for each lower-bound bin, or upper-bound histogram for each lower-bound bin. But, storing such amount of data seems too expensive. ------ With best regards, Alexander Korotkov.