Re: joins and low selectivity optimization

Julian Hyde Tue, 28 Aug 2018 19:24:22 -0700

Sure, Calcite makes use of stats in its cost formulas. And you are correct that 
“metadata” is what Calcite calls statistics.


But you have to be careful to only treat statistics as approximate. If the 
statistics were gathered using an “ANALYZE TABLE” command a month ago they may 
be out of date, so you cannot use them to, say, remove “WHERE x < 10” if a 
month ago x only had values 2, 4, and 6.

> On Aug 28, 2018, at 7:14 PM, Andrei Sereda <[email protected]> wrote:
> 
> Thank you, Michael and Julian, for your answers.
> 
> Even if optimizers don't have access to data can they have access to table
> statistics ? If I remember correctly Oracle CBO is estimating selectivity
> based on column distribution (histograms) and some formulas for density
> <https://gerardnico.com/db/oracle/statistics/density>. I realize these
> statistics are not available for all data stores but can calcite optimizer
> be "smarter" when this data is available ?
> 
> On Tue, Aug 28, 2018 at 9:46 PM Julian Hyde <[email protected]> wrote:
> 
>> If I recall correctly, Hive does this kind of optimization. It’s pretty
>> important you have a date dimension table and your fact table is
>> partitioned on date. Example:
>> 
>>  select *
>>  from sales
>>    join date_dim on sales.date_id = date_dim.id
>>  where sales.product_name = ‘foo'
>>  and date_dim.quarter = ‘2018-Q2'
>> 
>> Hive would like to transform it to
>> 
>>  select *
>>  from sales
>>  where date_id in (20180401, 20180402, … , 20180630)
>>  and sales.product_name = ‘foo'
>> 
>> by pre-evaluating the query on the date_dim table. It doesn’t do the
>> optimization at logical planning time (where Calcite is involved) but at
>> physical planning time (which occurs later). The list of date_id values
>> allows it to scan a much more limited set of partitions of the sales fact
>> table.
>> 
>> Michael is correct that optimizers don’t usually have access to data. But
>> if the date_dim table changes only slowly, you could set up a “tripwire”
>> that will invalidate the plan if the date_dim table happens to change
>> between planning and execution.
>> 
>> Julian
>> 
>> 
>> 
>> 
>>> On Aug 28, 2018, at 6:04 PM, Michael Mior <[email protected]> wrote:
>>> 
>>> As far as I am aware, the optimizer has no access to data, only metadata.
>>> The traditional way to solve such problems would be to select among
>>> different join algorithms which perform better for varying cardinalities
>> of
>>> each side of the join. Unfortunately, I think you're likely to have a
>> tough
>>> time extracting the necessary data to do the rewrite you're aiming for.
>>> 
>>> --
>>> Michael Mior
>>> [email protected]
>>> 
>>> 
>>> 
>>> Le mar. 28 août 2018 à 20:34, Andrei Sereda <[email protected]> a écrit :
>>> 
>>>> Hello,
>>>> 
>>>> I’m looking for a way to improve performance of a join query.
>>>> 
>>>> Suppose one joins two heterogeneous sources t1 and t2 with some
>> predicates.
>>>> 
>>>> Further assume that cardinality of one of the predicates is very low
>>>> (compared cardinality of the second one). (How) Is it possible to
>> convert
>>>> second query (predicate) to include results (primary keys) of the first
>> one
>>>> (with low selectivity) ?
>>>> Example
>>>> 
>>>> select *from
>>>> t1 left join t1 on (t1.id = t2.id)where
>>>> t1.attr = 'foo' and t2.attr = 'bar'
>>>> 
>>>> Let’s say that predicate t1.attr = 'foo' results in 3 rows (id=1, 2, 3).
>>>> Will it be possible to rewrite t2 query to :
>>>> 
>>>> select *from t2 where
>>>>  id in (1, 2, 3) and t2.attr = 'bar'
>>>> 
>>>> I’m aware of existence of Metadata
>>>> <
>>>> 
>> https://calcite.apache.org/apidocs/org/apache/calcite/rel/metadata/Metadata.html
>>>>> 
>>>> but not sure to use it.
>>>> 
>>>> Any hits / directions are appreciated.
>>>> 
>>>> Thanks,
>>>> Andrei.
>>>> 
>> 
>>

Re: joins and low selectivity optimization

Reply via email to