Hi!

I’m working on a new data format using Apache Drill as query engine. It has 
very nice scan speed, high compression ratio, and contains indexes to filter 
out irrelevant parts. Even more interesting is it supports extremely fast 
messages realtime ingestion. It is not yet open sourced but soon.

But right now we run into a issue and need your help. We know Drill can push 
down predicates to GroupScan. But it may transform the large IN condition into 
an HashJoin to improve performance. Like this:

0: jdbc:drill:drillbit=localhost> explain plan for select user_id from test 
where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      UnionExchange
01-01        Project(user_id=[$0])
01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
01-04            Project($f15=[$0], $f69=[$0])
01-05              
Scan(groupscan=[TestGroupScan@4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, 
rsFilter:null}, columns=[`user_id`]}])
01-03            BroadcastExchange
02-01              HashAgg(group=[{0}])
02-02                Values


The problem is we need to know there is a IN condition, either by predicates 
push down or other ways, so that we can do the optimization job. 

Drill has an option “planner.in_subquery_threshold” can prevent the planner 
from transforming the IN condition, but we would like to keep that untouched 
because a large  “or (equals … )” can also slow down things.

So our question is, if we keep the “in_subquery” planner optimization, is there 
any ways we can be notified about the IN condition?

Looking forward to your reply!

Regards
Flow Wei



Reply via email to