Hi, I was wondering about the pros and cons of having a Calcite rule which could rewrite a join to utilize bloom filters; something like:
select e.*
from emp e
join dept d on(e.deptno=d.deptno);
where d.dname='Sales';
into something like:
select e.*
from (
select e.* from emp e join (
select bloom_sketch(deptno) as sketch from dept
dname='Sales'
) dept_agg on (bloom_contains(sketch,e.deptno)
) e
join dept d on(e.deptno=d.deptno)
where d.dname='Sales';
Generally for the original query:
* if "dept" is very small a mapjoin is used which is great
* or possibly some nested loops with index usages on the big table
* but if the execution engine decides to use a non-specialized approach like merge-join or hash-join; it may move around a lot of data - and in those cases this might be
usefull
There are systems which handle this by introducing a bloom filter (Hive; Spark) and transfer that in the background for the big-table readers - but that's outside the scope of the planner. I was wondering if it would be beneficial or not to introduce such a rule - so that using this can be a cost-based decision during planning.
pro: * to enable an engine to support this optimization - it would only need to implement a few UDFs * the rule could put the use of this optimization under cost-based decision con: * an extra scan of the small table * it adds an extra join + aggregate computation * exec engine will most likely exploit that its just a single row * I guess without proper stats this could even worsen things * it could put more stress on (join) planning - as it could introduce more joins What do you guys think? cheers, Zoltan
OpenPGP_signature
Description: OpenPGP digital signature
