On Thu, Jul 26, 2018 at 10:30 PM, Dean Rasheed <[email protected]> wrote:
> On 26 July 2018 at 07:12, Ashutosh Bapat
> <[email protected]> wrote:
>> In the patch clauselist_selectivity() gets called repeatedly for every
>> new qual added to the clause list. Instead, if we try to combine the
>> cost/row estimation with order_qual_clauses() or
>> clauselist_selectivity(), we might be able to what the current patch
>> does in O(n). clauselist_selectivity() calculates combined selectivity
>> of all the given quals in O(n). We should do something similar to that
>> in this case.
>
> Duplicating the logic of clauselist_selectivity() seems like quite a
> lot of effort to go to just for improved filter cost estimates. Note
> also that clauselist_selectivity() might get a good deal more complex
> with multivariate stats.
I am not suggesting to duplicate code in clauselist_selectivity().
Instead I am suggesting that we incorporate costing in that function.
I don't know how feasible is that.
>
> Perhaps a reasonable simplification would be to just treat the clauses
> as independent when estimating the filter cost, and so use the
> following as a "good enough" estimate for the filter cost:
>
> cost(A) + sel(A) * cost(B) + sel(A) * sel(B) * cost(C) + ...
>
> That would probably be an improvement over what we currently have. It
> would be O(n) to compute, and it would probably use the cached
> selectivity estimates for the clauses.
That looks like a good trade-off. But looking at the following comment
in clause_selectivity(), I doubt if this will work in all the cases
/*
* If possible, cache the result of the selectivity calculation for
* the clause. We can cache if varRelid is zero or the clause
* contains only vars of that relid --- otherwise varRelid will affect
* the result, so mustn't cache. Outer join quals might be examined
* with either their join's actual jointype or JOIN_INNER, so we need
* two cache variables to remember both cases. Note: we assume the
* result won't change if we are switching the input relations or
* considering a unique-ified case, so we only need one cache variable
* for all non-JOIN_INNER cases.
*/
>
> Note also that with this simplified expression for the filter cost, it
> would be possible to improve order_qual_clauses() to take into account
> the clause selectivities when sorting the clauses, and minimise the
> cost of the filter by evaluating more selective clauses sooner, if
> they're not too expensive. I believe that amounts to sorting by
> cost/(1-sel) rather than just cost, except for clauses with sel==1,
> which it makes sense to move to the end, and just sort by cost.
+1, if we could do that.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company