If I recall correctly, Hive does this kind of optimization. It’s pretty
important you have a date dimension table and your fact table is partitioned on
date. Example:
select *
from sales
join date_dim on sales.date_id = date_dim.id
where sales.product_name = ‘foo'
and date_dim.quarter = ‘2018-Q2'
Hive would like to transform it to
select *
from sales
where date_id in (20180401, 20180402, … , 20180630)
and sales.product_name = ‘foo'
by pre-evaluating the query on the date_dim table. It doesn’t do the
optimization at logical planning time (where Calcite is involved) but at
physical planning time (which occurs later). The list of date_id values allows
it to scan a much more limited set of partitions of the sales fact table.
Michael is correct that optimizers don’t usually have access to data. But if
the date_dim table changes only slowly, you could set up a “tripwire” that will
invalidate the plan if the date_dim table happens to change between planning
and execution.
Julian
> On Aug 28, 2018, at 6:04 PM, Michael Mior <[email protected]> wrote:
>
> As far as I am aware, the optimizer has no access to data, only metadata.
> The traditional way to solve such problems would be to select among
> different join algorithms which perform better for varying cardinalities of
> each side of the join. Unfortunately, I think you're likely to have a tough
> time extracting the necessary data to do the rewrite you're aiming for.
>
> --
> Michael Mior
> [email protected]
>
>
>
> Le mar. 28 août 2018 à 20:34, Andrei Sereda <[email protected]> a écrit :
>
>> Hello,
>>
>> I’m looking for a way to improve performance of a join query.
>>
>> Suppose one joins two heterogeneous sources t1 and t2 with some predicates.
>>
>> Further assume that cardinality of one of the predicates is very low
>> (compared cardinality of the second one). (How) Is it possible to convert
>> second query (predicate) to include results (primary keys) of the first one
>> (with low selectivity) ?
>> Example
>>
>> select *from
>> t1 left join t1 on (t1.id = t2.id)where
>> t1.attr = 'foo' and t2.attr = 'bar'
>>
>> Let’s say that predicate t1.attr = 'foo' results in 3 rows (id=1, 2, 3).
>> Will it be possible to rewrite t2 query to :
>>
>> select *from t2 where
>> id in (1, 2, 3) and t2.attr = 'bar'
>>
>> I’m aware of existence of Metadata
>> <
>> https://calcite.apache.org/apidocs/org/apache/calcite/rel/metadata/Metadata.html
>>>
>> but not sure to use it.
>>
>> Any hits / directions are appreciated.
>>
>> Thanks,
>> Andrei.
>>