It would be great to have such a rule. People who don’t want it can disable it; and people who enable it can use a cost function.
Some systems that use Bloom filters (and other probabilistic filters) don’t execute the query twice but use a side-channel to send the Bloom filter from one scan to the other. For example, suppose that the “dept" table is smaller and its scan finishes faster. When the scan has finished, it sends the Bloom filter to the “emp" scan, which is still under way. From that point, the “emp” scan can eliminate a fraction of its rows because it knows that their “deptno” values do not pass the filter. Julian > On Apr 28, 2023, at 9:01 AM, Zoltan Haindrich <[email protected]> wrote: > > Hi, > > I was wondering about the pros and cons of having a Calcite rule which could > rewrite a join to utilize bloom filters; something like: > > select e.* > from emp e > join dept d on(e.deptno=d.deptno); > where d.dname='Sales'; > > into something like: > > select e.* > from ( > select e.* from emp e join ( > select bloom_sketch(deptno) as sketch from dept > dname='Sales' > ) dept_agg on (bloom_contains(sketch,e.deptno) > ) e > join dept d on(e.deptno=d.deptno) > where d.dname='Sales'; > > Generally for the original query: > * if "dept" is very small a mapjoin is used which is great > * or possibly some nested loops with index usages on the big table > * but if the execution engine decides to use a non-specialized approach like > merge-join or hash-join; it may move around a lot of data - and in those > cases this might be usefull > > There are systems which handle this by introducing a bloom filter (Hive; > Spark) and transfer that in the background for the big-table readers - but > that's outside the scope of the planner. I was wondering if it would be > beneficial or not to introduce such a rule - so that using this can be a > cost-based decision during planning. > > pro: > * to enable an engine to support this optimization - it would only need to > implement a few UDFs > * the rule could put the use of this optimization under cost-based decision > > con: > * an extra scan of the small table > * it adds an extra join + aggregate computation > * exec engine will most likely exploit that its just a single row > * I guess without proper stats this could even worsen things > * it could put more stress on (join) planning - as it could introduce more > joins > > What do you guys think? > > cheers, > Zoltan > >
