Thank you, Michael and Julian, for your answers. Even if optimizers don't have access to data can they have access to table statistics ? If I remember correctly Oracle CBO is estimating selectivity based on column distribution (histograms) and some formulas for density <https://gerardnico.com/db/oracle/statistics/density>. I realize these statistics are not available for all data stores but can calcite optimizer be "smarter" when this data is available ?
On Tue, Aug 28, 2018 at 9:46 PM Julian Hyde <[email protected]> wrote: > If I recall correctly, Hive does this kind of optimization. It’s pretty > important you have a date dimension table and your fact table is > partitioned on date. Example: > > select * > from sales > join date_dim on sales.date_id = date_dim.id > where sales.product_name = ‘foo' > and date_dim.quarter = ‘2018-Q2' > > Hive would like to transform it to > > select * > from sales > where date_id in (20180401, 20180402, … , 20180630) > and sales.product_name = ‘foo' > > by pre-evaluating the query on the date_dim table. It doesn’t do the > optimization at logical planning time (where Calcite is involved) but at > physical planning time (which occurs later). The list of date_id values > allows it to scan a much more limited set of partitions of the sales fact > table. > > Michael is correct that optimizers don’t usually have access to data. But > if the date_dim table changes only slowly, you could set up a “tripwire” > that will invalidate the plan if the date_dim table happens to change > between planning and execution. > > Julian > > > > > > On Aug 28, 2018, at 6:04 PM, Michael Mior <[email protected]> wrote: > > > > As far as I am aware, the optimizer has no access to data, only metadata. > > The traditional way to solve such problems would be to select among > > different join algorithms which perform better for varying cardinalities > of > > each side of the join. Unfortunately, I think you're likely to have a > tough > > time extracting the necessary data to do the rewrite you're aiming for. > > > > -- > > Michael Mior > > [email protected] > > > > > > > > Le mar. 28 août 2018 à 20:34, Andrei Sereda <[email protected]> a écrit : > > > >> Hello, > >> > >> I’m looking for a way to improve performance of a join query. > >> > >> Suppose one joins two heterogeneous sources t1 and t2 with some > predicates. > >> > >> Further assume that cardinality of one of the predicates is very low > >> (compared cardinality of the second one). (How) Is it possible to > convert > >> second query (predicate) to include results (primary keys) of the first > one > >> (with low selectivity) ? > >> Example > >> > >> select *from > >> t1 left join t1 on (t1.id = t2.id)where > >> t1.attr = 'foo' and t2.attr = 'bar' > >> > >> Let’s say that predicate t1.attr = 'foo' results in 3 rows (id=1, 2, 3). > >> Will it be possible to rewrite t2 query to : > >> > >> select *from t2 where > >> id in (1, 2, 3) and t2.attr = 'bar' > >> > >> I’m aware of existence of Metadata > >> < > >> > https://calcite.apache.org/apidocs/org/apache/calcite/rel/metadata/Metadata.html > >>> > >> but not sure to use it. > >> > >> Any hits / directions are appreciated. > >> > >> Thanks, > >> Andrei. > >> > >
