A follow-up question. I presume this is applicable equally to joins without predicates ?
select *from t1 join t2 on (t1.id = t2.id) Even if t1 has a much lower cardinality than t2 (say many orders of magnitude), Calcite would not know to convert t2 query to: select * from t2 where id in (1, 2, 3) -- many ids “Manual” optimization Even the following might get “de-optimized” ? select *from t1 join t2 on (t1.id = t2.id)where t2.id in (select id from t1) -- force sub select On Tue, Aug 28, 2018 at 10:24 PM Julian Hyde <[email protected]> wrote: > Sure, Calcite makes use of stats in its cost formulas. And you are correct > that “metadata” is what Calcite calls statistics. > > But you have to be careful to only treat statistics as approximate. If the > statistics were gathered using an “ANALYZE TABLE” command a month ago they > may be out of date, so you cannot use them to, say, remove “WHERE x < 10” > if a month ago x only had values 2, 4, and 6. > > > On Aug 28, 2018, at 7:14 PM, Andrei Sereda <[email protected]> wrote: > > > > Thank you, Michael and Julian, for your answers. > > > > Even if optimizers don't have access to data can they have access to > table > > statistics ? If I remember correctly Oracle CBO is estimating selectivity > > based on column distribution (histograms) and some formulas for density > > <https://gerardnico.com/db/oracle/statistics/density>. I realize these > > statistics are not available for all data stores but can calcite > optimizer > > be "smarter" when this data is available ? > > > > On Tue, Aug 28, 2018 at 9:46 PM Julian Hyde <[email protected]> wrote: > > > >> If I recall correctly, Hive does this kind of optimization. It’s pretty > >> important you have a date dimension table and your fact table is > >> partitioned on date. Example: > >> > >> select * > >> from sales > >> join date_dim on sales.date_id = date_dim.id > >> where sales.product_name = ‘foo' > >> and date_dim.quarter = ‘2018-Q2' > >> > >> Hive would like to transform it to > >> > >> select * > >> from sales > >> where date_id in (20180401, 20180402, … , 20180630) > >> and sales.product_name = ‘foo' > >> > >> by pre-evaluating the query on the date_dim table. It doesn’t do the > >> optimization at logical planning time (where Calcite is involved) but at > >> physical planning time (which occurs later). The list of date_id values > >> allows it to scan a much more limited set of partitions of the sales > fact > >> table. > >> > >> Michael is correct that optimizers don’t usually have access to data. > But > >> if the date_dim table changes only slowly, you could set up a “tripwire” > >> that will invalidate the plan if the date_dim table happens to change > >> between planning and execution. > >> > >> Julian > >> > >> > >> > >> > >>> On Aug 28, 2018, at 6:04 PM, Michael Mior <[email protected]> wrote: > >>> > >>> As far as I am aware, the optimizer has no access to data, only > metadata. > >>> The traditional way to solve such problems would be to select among > >>> different join algorithms which perform better for varying > cardinalities > >> of > >>> each side of the join. Unfortunately, I think you're likely to have a > >> tough > >>> time extracting the necessary data to do the rewrite you're aiming for. > >>> > >>> -- > >>> Michael Mior > >>> [email protected] > >>> > >>> > >>> > >>> Le mar. 28 août 2018 à 20:34, Andrei Sereda <[email protected]> a > écrit : > >>> > >>>> Hello, > >>>> > >>>> I’m looking for a way to improve performance of a join query. > >>>> > >>>> Suppose one joins two heterogeneous sources t1 and t2 with some > >> predicates. > >>>> > >>>> Further assume that cardinality of one of the predicates is very low > >>>> (compared cardinality of the second one). (How) Is it possible to > >> convert > >>>> second query (predicate) to include results (primary keys) of the > first > >> one > >>>> (with low selectivity) ? > >>>> Example > >>>> > >>>> select *from > >>>> t1 left join t1 on (t1.id = t2.id)where > >>>> t1.attr = 'foo' and t2.attr = 'bar' > >>>> > >>>> Let’s say that predicate t1.attr = 'foo' results in 3 rows (id=1, 2, > 3). > >>>> Will it be possible to rewrite t2 query to : > >>>> > >>>> select *from t2 where > >>>> id in (1, 2, 3) and t2.attr = 'bar' > >>>> > >>>> I’m aware of existence of Metadata > >>>> < > >>>> > >> > https://calcite.apache.org/apidocs/org/apache/calcite/rel/metadata/Metadata.html > >>>>> > >>>> but not sure to use it. > >>>> > >>>> Any hits / directions are appreciated. > >>>> > >>>> Thanks, > >>>> Andrei. > >>>> > >> > >> > >
