Hi!
I’m working on a new data format using Apache Drill as query engine. It has
very nice scan speed, high compression ratio, and contains indexes to filter
out irrelevant parts. Even more interesting is it supports extremely fast
messages realtime ingestion. It is not yet open sourced but soon.
But right now we run into a issue and need your help. We know Drill can push
down predicates to GroupScan. But it may transform the large IN condition into
an HashJoin to improve performance. Like this:
0: jdbc:drill:drillbit=localhost> explain plan for select user_id from test
where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 UnionExchange
01-01 Project(user_id=[$0])
01-02 HashJoin(condition=[=($1, $2)], joinType=[inner])
01-04 Project($f15=[$0], $f69=[$0])
01-05
Scan(groupscan=[TestGroupScan@4013af3d{Spec=TestScanSpec@73ccb9dc{table:test,
rsFilter:null}, columns=[`user_id`]}])
01-03 BroadcastExchange
02-01 HashAgg(group=[{0}])
02-02 Values
The problem is we need to know there is a IN condition, either by predicates
push down or other ways, so that we can do the optimization job.
Drill has an option “planner.in_subquery_threshold” can prevent the planner
from transforming the IN condition, but we would like to keep that untouched
because a large “or (equals … )” can also slow down things.
So our question is, if we keep the “in_subquery” planner optimization, is there
any ways we can be notified about the IN condition?
Looking forward to your reply!
Regards
Flow Wei