Re: joins and low selectivity optimization

Julian Hyde Tue, 28 Aug 2018 18:46:08 -0700

If I recall correctly, Hive does this kind of optimization. It’s pretty 
important you have a date dimension table and your fact table is partitioned on 
date. Example:


  select *
  from sales
    join date_dim on sales.date_id = date_dim.id
  where sales.product_name = ‘foo'
  and date_dim.quarter = ‘2018-Q2'

Hive would like to transform it to

  select *
  from sales
  where date_id in (20180401, 20180402, … , 20180630)
  and sales.product_name = ‘foo'

by pre-evaluating the query on the date_dim table. It doesn’t do the 
optimization at logical planning time (where Calcite is involved) but at 
physical planning time (which occurs later). The list of date_id values allows 
it to scan a much more limited set of partitions of the sales fact table.

Michael is correct that optimizers don’t usually have access to data. But if 
the date_dim table changes only slowly, you could set up a “tripwire” that will 
invalidate the plan if the date_dim table happens to change between planning 
and execution.

Julian




> On Aug 28, 2018, at 6:04 PM, Michael Mior <[email protected]> wrote:
> 
> As far as I am aware, the optimizer has no access to data, only metadata.
> The traditional way to solve such problems would be to select among
> different join algorithms which perform better for varying cardinalities of
> each side of the join. Unfortunately, I think you're likely to have a tough
> time extracting the necessary data to do the rewrite you're aiming for.
> 
> --
> Michael Mior
> [email protected]
> 
> 
> 
> Le mar. 28 août 2018 à 20:34, Andrei Sereda <[email protected]> a écrit :
> 
>> Hello,
>> 
>> I’m looking for a way to improve performance of a join query.
>> 
>> Suppose one joins two heterogeneous sources t1 and t2 with some predicates.
>> 
>> Further assume that cardinality of one of the predicates is very low
>> (compared cardinality of the second one). (How) Is it possible to convert
>> second query (predicate) to include results (primary keys) of the first one
>> (with low selectivity) ?
>> Example
>> 
>> select *from
>>  t1 left join t1 on (t1.id = t2.id)where
>>  t1.attr = 'foo' and t2.attr = 'bar'
>> 
>> Let’s say that predicate t1.attr = 'foo' results in 3 rows (id=1, 2, 3).
>> Will it be possible to rewrite t2 query to :
>> 
>> select *from t2 where
>>   id in (1, 2, 3) and t2.attr = 'bar'
>> 
>> I’m aware of existence of Metadata
>> <
>> https://calcite.apache.org/apidocs/org/apache/calcite/rel/metadata/Metadata.html
>>> 
>> but not sure to use it.
>> 
>> Any hits / directions are appreciated.
>> 
>> Thanks,
>> Andrei.
>>

Re: joins and low selectivity optimization

Reply via email to