Re: joins and low selectivity optimization

Andrei Sereda Wed, 29 Aug 2018 07:49:53 -0700

A follow-up question. I presume this is applicable equally to joins without
predicates ?


select *from
  t1 join t2 on (t1.id = t2.id)

Even if t1 has a much lower cardinality than t2 (say many orders of
magnitude), Calcite would not know to convert t2 query to:

select * from t2 where id in (1, 2, 3) -- many ids

“Manual” optimization

Even the following might get “de-optimized” ?

select *from
  t1 join t2 on (t1.id = t2.id)where
  t2.id in (select id from t1) -- force sub select


On Tue, Aug 28, 2018 at 10:24 PM Julian Hyde <[email protected]> wrote:

> Sure, Calcite makes use of stats in its cost formulas. And you are correct
> that “metadata” is what Calcite calls statistics.
>
> But you have to be careful to only treat statistics as approximate. If the
> statistics were gathered using an “ANALYZE TABLE” command a month ago they
> may be out of date, so you cannot use them to, say, remove “WHERE x < 10”
> if a month ago x only had values 2, 4, and 6.
>
> > On Aug 28, 2018, at 7:14 PM, Andrei Sereda <[email protected]> wrote:
> >
> > Thank you, Michael and Julian, for your answers.
> >
> > Even if optimizers don't have access to data can they have access to
> table
> > statistics ? If I remember correctly Oracle CBO is estimating selectivity
> > based on column distribution (histograms) and some formulas for density
> > <https://gerardnico.com/db/oracle/statistics/density>. I realize these
> > statistics are not available for all data stores but can calcite
> optimizer
> > be "smarter" when this data is available ?
> >
> > On Tue, Aug 28, 2018 at 9:46 PM Julian Hyde <[email protected]> wrote:
> >
> >> If I recall correctly, Hive does this kind of optimization. It’s pretty
> >> important you have a date dimension table and your fact table is
> >> partitioned on date. Example:
> >>
> >>  select *
> >>  from sales
> >>    join date_dim on sales.date_id = date_dim.id
> >>  where sales.product_name = ‘foo'
> >>  and date_dim.quarter = ‘2018-Q2'
> >>
> >> Hive would like to transform it to
> >>
> >>  select *
> >>  from sales
> >>  where date_id in (20180401, 20180402, … , 20180630)
> >>  and sales.product_name = ‘foo'
> >>
> >> by pre-evaluating the query on the date_dim table. It doesn’t do the
> >> optimization at logical planning time (where Calcite is involved) but at
> >> physical planning time (which occurs later). The list of date_id values
> >> allows it to scan a much more limited set of partitions of the sales
> fact
> >> table.
> >>
> >> Michael is correct that optimizers don’t usually have access to data.
> But
> >> if the date_dim table changes only slowly, you could set up a “tripwire”
> >> that will invalidate the plan if the date_dim table happens to change
> >> between planning and execution.
> >>
> >> Julian
> >>
> >>
> >>
> >>
> >>> On Aug 28, 2018, at 6:04 PM, Michael Mior <[email protected]> wrote:
> >>>
> >>> As far as I am aware, the optimizer has no access to data, only
> metadata.
> >>> The traditional way to solve such problems would be to select among
> >>> different join algorithms which perform better for varying
> cardinalities
> >> of
> >>> each side of the join. Unfortunately, I think you're likely to have a
> >> tough
> >>> time extracting the necessary data to do the rewrite you're aiming for.
> >>>
> >>> --
> >>> Michael Mior
> >>> [email protected]
> >>>
> >>>
> >>>
> >>> Le mar. 28 août 2018 à 20:34, Andrei Sereda <[email protected]> a
> écrit :
> >>>
> >>>> Hello,
> >>>>
> >>>> I’m looking for a way to improve performance of a join query.
> >>>>
> >>>> Suppose one joins two heterogeneous sources t1 and t2 with some
> >> predicates.
> >>>>
> >>>> Further assume that cardinality of one of the predicates is very low
> >>>> (compared cardinality of the second one). (How) Is it possible to
> >> convert
> >>>> second query (predicate) to include results (primary keys) of the
> first
> >> one
> >>>> (with low selectivity) ?
> >>>> Example
> >>>>
> >>>> select *from
> >>>> t1 left join t1 on (t1.id = t2.id)where
> >>>> t1.attr = 'foo' and t2.attr = 'bar'
> >>>>
> >>>> Let’s say that predicate t1.attr = 'foo' results in 3 rows (id=1, 2,
> 3).
> >>>> Will it be possible to rewrite t2 query to :
> >>>>
> >>>> select *from t2 where
> >>>>  id in (1, 2, 3) and t2.attr = 'bar'
> >>>>
> >>>> I’m aware of existence of Metadata
> >>>> <
> >>>>
> >>
> https://calcite.apache.org/apidocs/org/apache/calcite/rel/metadata/Metadata.html
> >>>>>
> >>>> but not sure to use it.
> >>>>
> >>>> Any hits / directions are appreciated.
> >>>>
> >>>> Thanks,
> >>>> Andrei.
> >>>>
> >>
> >>
>
>

Re: joins and low selectivity optimization

Reply via email to