Hi Vladimir, Thanks for follow-up and explanation. I wanted to make sure I'm not missing (mis-understanding) anything.
Andrei. On Wed, Aug 29, 2018 at 11:01 AM Vladimir Sitnikov < [email protected]> wrote: > One of the approaches to such queries is to throw Bloom filters all over > the place. > > That is it could execute "small side" of the join, collect the ids (or a > lossy version of it in a form of Bloom filters), > and it could propagate that Bloom filter to the second source to reduce the > set of rows produced by the second row source. > Then the join would be easier to do since the second row source is reduced. > > The sad thing is not all systems support propagation of bloom filters. > > >select *from > > t1 join t2 on (t1.id = t2.id)where > > t2.id in (select id from t1) -- force sub selec > > What if Calcite did just a regular batched nested loop join? > That is: > 1. Fetch next 10 rows from t1 > 2. Fetch "from t2 where id in (...)" > 3. goto 1 > > It can be expressed via correlated subqueries, however: > a) I'm not sure correlated subqueries work great at the moment > b) Support for "batched" correlated execution is likely not there > c) Calcite should somehow know the true cost of "from t2 where id in (1,2)" > vs "from t2 where id in (1,2,3,4)". In other words, current costing model > does not take into account if the table has index or not. One can code such > costing rules, however I think it is not there yet. > > Vladimir >
