On Wed, Jun 6, 2018 at 7:18 PM Claudio Freire <klaussfre...@gmail.com> wrote: > > > Comparison cost can be approximated probabilistically as: > > > > > > cost_comp = sum(cost_op_n * (1.0 / ndistinct(col_1_to_n))) > > > > > > Where cost_op_n is the cost of the comparison function for column N, > > > and ndistinct(col_1_to_n) is an estimation of the number of distinct > > > values for columns prior to N in the sort order. > > > > > > You can approximate ndistinct as the product of ndistinct of previous > > > columns, or use extended statistics when available. > > > > > > > Sure. The trouble is this also assumes uniform distribution, which is > > not always the case. > > Well, (1.0 / ndistinct) = p(left == right). > > If you can compute a better p(left == right) with an MCV, you can get > a better estimate. If you have an MCV. But that'd be a bit expensive, > and I'm not sure it's all that relevant. > > To what degree does improving this produce better plans? As long as > average expected cost is relatively well estimated (as in, one sort > order relative to another sort order), it won't affect the decision. > > But if needed, the MCV can be used for this. > > So, in essence, you need to account for: > > - restrictions on that column that constrain the domain > - distribution skew (MCV, nulls) > - ndistinct > > To compute p(left == right). > > Maybe something as simple as the following? > > p_special = sum(mcv_i * mcv_i) + null_frac * null_frac > p_other = (1 - p_special) * (1 - p_special) / ndistinct(restr)
Well, that came out with a slew of math errors, but the idea is clear: compute p(left == right) given the available statistics and constrained by any applicable restrictions.