2016-11-16 8:17 GMT+01:00 WeiWan <[email protected]>:

> Hi Julian,
>
> What I mean by “notified about the IN condition" here is My TestGroupScan
> should knows the IN condition when people writes SQLs like “WHERE a IN
> (…)”. It is difficult for me because Drill’s physical planner has transform
> it into a HashJoin.
>
> For example if the values of IN condition lest than 20, the physical plan
> looks like this:
>
> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
> test where user_id in (1,2,3,4,5,6);
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      UnionExchange
> 01-01        Project(user_id=[$0])
> 01-02          SelectionVectorRemover
> 01-03            Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), =($0,
> 4), =($0, 5), =($0, 6))])
> 01-04              Scan(groupscan=[TestGroupScan@
> 58253efd{Spec=TestScanSpec@893899c{table:test, rsFilter:In($user_id:
> (6)[1,2,3,4,5 ...])}, columns=[`user_id`]}])
>
> And I can use a rule like this to fetch the filter:
>
>   public static StoragePluginOptimizerRule TestRule = new
> StoragePluginOptimizerRule(
>       RelOptHelper.some(FilterPrel.class,
>           RelOptHelper.any(ScanPrel.class)), "TestRule") {
>     @Override
>     public void onMatch(RelOptRuleCall call) {
>       FilterPrel filter = (FilterPrel) call.rel(0);
>       ScanPrel scan = (ScanPrel) call.rel(1);
>       RexNode condition = filter.getCondition();
>
>       // Then we can do the scan optimization with “filter”, which
> including IN condition.
>     }
>   };
>
> But when the values are over 20, the physical plan becomes like this:
>

​There is this drill config parameter (drill>= 1.8) in_subquery_threshold .
Default value is 20. Just set it to a very high value (its a long field)​



> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,
> 13,14,15,16,17,18,19,20);
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      UnionExchange
> 01-01        Project(user_id=[$0])
> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
> 01-04            Project($f15=[$0], $f69=[$0])
> 01-05              Scan(groupscan=[TestGroupScan@
> 132c4830{Spec=TestScanSpec@416812cf{table:test, rsFilter:null},
> columns=[`user_id`]}])
> 01-03            BroadcastExchange
> 02-01              HashAgg(group=[{0}])
> 02-02                Values
>
> Now I need to find another way to know and get the IN condition, including
> the field name, and the values of IN clause. How can I do this?
>
>
> Regards
> Flow Wei
>
>
>
> > On Nov 16, 2016, at 13:32, Julian Hyde <[email protected]> wrote:
> >
> > I’m not sure what you mean by “notified about the IN condition”. You
> either have to convert it to a semi-join or not.
> >
> > I don’t know how expensive hash-joins are in Drill. If they’re
> expensive, you could hand-write a UDF that builds a java hash-map, and see
> whether it performs better.
> >
> > Julian
> >
> >
> >> On Nov 15, 2016, at 7:38 PM, WeiWan <[email protected]> wrote:
> >>
> >> Hi!
> >>
> >> I’m working on a new data format using Apache Drill as query engine. It
> has very nice scan speed, high compression ratio, and contains indexes to
> filter out irrelevant parts. Even more interesting is it supports extremely
> fast messages realtime ingestion. It is not yet open sourced but soon.
> >>
> >> But right now we run into a issue and need your help. We know Drill can
> push down predicates to GroupScan. But it may transform the large IN
> condition into an HashJoin to improve performance. Like this:
> >>
> >> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,
> 13,14,15,16,17,18,19,20);
> >> +------+------+
> >> | text | json |
> >> +------+------+
> >> | 00-00    Screen
> >> 00-01      UnionExchange
> >> 01-01        Project(user_id=[$0])
> >> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
> >> 01-04            Project($f15=[$0], $f69=[$0])
> >> 01-05              Scan(groupscan=[TestGroupScan@
> 4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, rsFilter:null},
> columns=[`user_id`]}])
> >> 01-03            BroadcastExchange
> >> 02-01              HashAgg(group=[{0}])
> >> 02-02                Values
> >>
> >>
> >> The problem is we need to know there is a IN condition, either by
> predicates push down or other ways, so that we can do the optimization job.
> >>
> >> Drill has an option “planner.in_subquery_threshold” can prevent the
> planner from transforming the IN condition, but we would like to keep that
> untouched because a large  “or (equals … )” can also slow down things.
> >>
> >> So our question is, if we keep the “in_subquery” planner optimization,
> is there any ways we can be notified about the IN condition?
> >>
> >> Looking forward to your reply!
> >>
> >> Regards
> >> Flow Wei
> >>
> >>
> >>
> >
>
>

Reply via email to