Hi,

I was wondering about the pros and cons of having a Calcite rule which could 
rewrite a join to utilize bloom filters; something like:

select e.*
        from emp e
        join dept d on(e.deptno=d.deptno);
        where d.dname='Sales';

into something like:

select e.*
        from (
                select e.* from emp e join (
                                select bloom_sketch(deptno) as sketch from dept 
dname='Sales'
                        ) dept_agg on (bloom_contains(sketch,e.deptno)
        ) e
        join dept d on(e.deptno=d.deptno)
        where d.dname='Sales';

Generally for the original query:
* if "dept" is very small a mapjoin is used which is great
* or possibly some nested loops with index usages on the big table
* but if the execution engine decides to use a non-specialized approach like merge-join or hash-join; it may move around a lot of data - and in those cases this might be usefull

There are systems which handle this by introducing a bloom filter (Hive; Spark) and transfer that in the background for the big-table readers - but that's outside the scope of the planner. I was wondering if it would be beneficial or not to introduce such a rule - so that using this can be a cost-based decision during planning.

pro:
* to enable an engine to support this optimization - it would only need to 
implement a few UDFs
* the rule could put the use of this optimization under cost-based decision

con:
* an extra scan of the small table
* it adds an extra join + aggregate computation
  * exec engine will most likely exploit that its just a single row
* I guess without proper stats this could even worsen things
* it could put more stress on (join) planning - as it could introduce more joins

What do you guys think?

cheers,
Zoltan


Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to