I’m not sure what you mean by “notified about the IN condition”. You either have to convert it to a semi-join or not.
I don’t know how expensive hash-joins are in Drill. If they’re expensive, you could hand-write a UDF that builds a java hash-map, and see whether it performs better. Julian > On Nov 15, 2016, at 7:38 PM, WeiWan <[email protected]> wrote: > > Hi! > > I’m working on a new data format using Apache Drill as query engine. It has > very nice scan speed, high compression ratio, and contains indexes to filter > out irrelevant parts. Even more interesting is it supports extremely fast > messages realtime ingestion. It is not yet open sourced but soon. > > But right now we run into a issue and need your help. We know Drill can push > down predicates to GroupScan. But it may transform the large IN condition > into an HashJoin to improve performance. Like this: > > 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from test > where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20); > +------+------+ > | text | json | > +------+------+ > | 00-00 Screen > 00-01 UnionExchange > 01-01 Project(user_id=[$0]) > 01-02 HashJoin(condition=[=($1, $2)], joinType=[inner]) > 01-04 Project($f15=[$0], $f69=[$0]) > 01-05 > Scan(groupscan=[TestGroupScan@4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, > rsFilter:null}, columns=[`user_id`]}]) > 01-03 BroadcastExchange > 02-01 HashAgg(group=[{0}]) > 02-02 Values > > > The problem is we need to know there is a IN condition, either by predicates > push down or other ways, so that we can do the optimization job. > > Drill has an option “planner.in_subquery_threshold” can prevent the planner > from transforming the IN condition, but we would like to keep that untouched > because a large “or (equals … )” can also slow down things. > > So our question is, if we keep the “in_subquery” planner optimization, is > there any ways we can be notified about the IN condition? > > Looking forward to your reply! > > Regards > Flow Wei > > >
