Re: joins and low selectivity optimization

Andrei Sereda Tue, 19 Feb 2019 16:32:24 -0800

Hello,

I would like to resurrect this thread in the context of calcite streams.
Unfortunately bloom-filters is not an option for the data-sources being
used.

Say one has stream to table join
<https://calcite.apache.org/docs/stream.html#joining-streams-to-tables>.
>From docs example:

SELECT STREAM
    o.productId, o.orderId, o.units, p.name, p.unitPrice FROM Orders
AS o -- streamable Table JOIN Products AS p -- reference data table
    ON o.productId = p.productId;

   1. Am I correct to assume that each event in Orders table (which is a
   stream) will trigger full table scan (without filter) on Products table
   ?
   2. Can I register my custom rule to rewrite the query when, say, Orders
   and Products tables are present to manually add a sub query ?
   3. Do I have to disable SubQueryRemoveRule in this case ?
   4. Vadym, not sure how sub-query computation will work. Can I partially
   execute the query and convert the subquery into EnumerableValues ?

Is there a way to solve this problem non-generically ?

We’re also hitting this limitation in Flink (which uses calcite but not
calcite streams) for similar use-case.

Many Thanks,
Andrei.

On Thu, Aug 30, 2018 at 5:27 PM Vineet Garg <[email protected]> wrote:

> Hive actually does this optimization (it is called semi-join reduction) by
> generating bloom-filters on one side and passing it on to the other side.
> This is not a rewrite but instead a physical implementation.
>
> Vineet
>
> On Aug 29, 2018, at 10:34 AM, Vladimir Sitnikov <
> [email protected]<mailto:[email protected]>> wrote:
>
> Nested loops are never likely to happe
>
> What's wrong with that?
>
> Apparently Andrei asks for that, and "subquery precomputation" is quite
> close to nested loops in my opinion.
>
> Vladimir
>
>

Re: joins and low selectivity optimization

Reply via email to