Re: joins and low selectivity optimization

Andrei Sereda Tue, 28 Aug 2018 19:15:01 -0700

Thank you, Michael and Julian, for your answers.

Even if optimizers don't have access to data can they have access to table
statistics ? If I remember correctly Oracle CBO is estimating selectivity
based on column distribution (histograms) and some formulas for density
<https://gerardnico.com/db/oracle/statistics/density>. I realize these
statistics are not available for all data stores but can calcite optimizer
be "smarter" when this data is available ?


On Tue, Aug 28, 2018 at 9:46 PM Julian Hyde <[email protected]> wrote:

> If I recall correctly, Hive does this kind of optimization. It’s pretty
> important you have a date dimension table and your fact table is
> partitioned on date. Example:
>
>   select *
>   from sales
>     join date_dim on sales.date_id = date_dim.id
>   where sales.product_name = ‘foo'
>   and date_dim.quarter = ‘2018-Q2'
>
> Hive would like to transform it to
>
>   select *
>   from sales
>   where date_id in (20180401, 20180402, … , 20180630)
>   and sales.product_name = ‘foo'
>
> by pre-evaluating the query on the date_dim table. It doesn’t do the
> optimization at logical planning time (where Calcite is involved) but at
> physical planning time (which occurs later). The list of date_id values
> allows it to scan a much more limited set of partitions of the sales fact
> table.
>
> Michael is correct that optimizers don’t usually have access to data. But
> if the date_dim table changes only slowly, you could set up a “tripwire”
> that will invalidate the plan if the date_dim table happens to change
> between planning and execution.
>
> Julian
>
>
>
>
> > On Aug 28, 2018, at 6:04 PM, Michael Mior <[email protected]> wrote:
> >
> > As far as I am aware, the optimizer has no access to data, only metadata.
> > The traditional way to solve such problems would be to select among
> > different join algorithms which perform better for varying cardinalities
> of
> > each side of the join. Unfortunately, I think you're likely to have a
> tough
> > time extracting the necessary data to do the rewrite you're aiming for.
> >
> > --
> > Michael Mior
> > [email protected]
> >
> >
> >
> > Le mar. 28 août 2018 à 20:34, Andrei Sereda <[email protected]> a écrit :
> >
> >> Hello,
> >>
> >> I’m looking for a way to improve performance of a join query.
> >>
> >> Suppose one joins two heterogeneous sources t1 and t2 with some
> predicates.
> >>
> >> Further assume that cardinality of one of the predicates is very low
> >> (compared cardinality of the second one). (How) Is it possible to
> convert
> >> second query (predicate) to include results (primary keys) of the first
> one
> >> (with low selectivity) ?
> >> Example
> >>
> >> select *from
> >>  t1 left join t1 on (t1.id = t2.id)where
> >>  t1.attr = 'foo' and t2.attr = 'bar'
> >>
> >> Let’s say that predicate t1.attr = 'foo' results in 3 rows (id=1, 2, 3).
> >> Will it be possible to rewrite t2 query to :
> >>
> >> select *from t2 where
> >>   id in (1, 2, 3) and t2.attr = 'bar'
> >>
> >> I’m aware of existence of Metadata
> >> <
> >>
> https://calcite.apache.org/apidocs/org/apache/calcite/rel/metadata/Metadata.html
> >>>
> >> but not sure to use it.
> >>
> >> Any hits / directions are appreciated.
> >>
> >> Thanks,
> >> Andrei.
> >>
>
>

Re: joins and low selectivity optimization

Reply via email to