Re: About the IN condition push down.

WeiWan Wed, 16 Nov 2016 01:00:07 -0800

Hi Nicolas,

I know the option “planner.in_subquery_threshold”. But as a Drill plugin, we 
don’t want to force people to set this option. Besides it could hurt the 
performance if we set it to a very large number.


I’m considering using a rule like this to get the in-lists:

  public static StoragePluginOptimizerRule JoinFromIn = new 
StoragePluginOptimizerRule(
      RelOptHelper.some(HashJoinPrel.class,
          RelOptHelper.some(ProjectPrel.class,
              RelOptHelper.any(ScanPrel.class)),
          RelOptHelper.some(BroadcastExchangePrel.class,
              RelOptHelper.some(HashAggPrel.class,
                  RelOptHelper.any(ValuesPrel.class))))
      , "JoinFromIn") {
    
    // ....

  }

from the physical plan :

| 00-00    Screen
00-01      UnionExchange
01-01        Project(user_id=[$0])
01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
01-04            Project($f15=[$0], $f69=[$0])
01-05              
Scan(groupscan=[TestGroupScan@4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, 
rsFilter:null}, columns=[`user_id`]}])
01-03            BroadcastExchange
02-01              HashAgg(group=[{0}])
02-02                Values


When I found a HashJoin with a Scan and a HashAgg with Values, I can assume 
that it is transform from an in-lists.

But I found that it doesn’t match any RelNode.  Any suggestions for my rule? 


Regards
Flow Wei



> On Nov 16, 2016, at 15:31, Nicolas Paris <[email protected]> wrote:
> 
> 2016-11-16 8:17 GMT+01:00 WeiWan <[email protected] 
> <mailto:[email protected]>>:
> 
>> Hi Julian,
>> 
>> What I mean by “notified about the IN condition" here is My TestGroupScan
>> should knows the IN condition when people writes SQLs like “WHERE a IN
>> (…)”. It is difficult for me because Drill’s physical planner has transform
>> it into a HashJoin.
>> 
>> For example if the values of IN condition lest than 20, the physical plan
>> looks like this:
>> 
>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
>> test where user_id in (1,2,3,4,5,6);
>> +------+------+
>> | text | json |
>> +------+------+
>> | 00-00    Screen
>> 00-01      UnionExchange
>> 01-01        Project(user_id=[$0])
>> 01-02          SelectionVectorRemover
>> 01-03            Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), =($0,
>> 4), =($0, 5), =($0, 6))])
>> 01-04              Scan(groupscan=[TestGroupScan@
>> 58253efd{Spec=TestScanSpec@893899c{table:test, rsFilter:In($user_id:
>> (6)[1,2,3,4,5 ...])}, columns=[`user_id`]}])
>> 
>> And I can use a rule like this to fetch the filter:
>> 
>>  public static StoragePluginOptimizerRule TestRule = new
>> StoragePluginOptimizerRule(
>>      RelOptHelper.some(FilterPrel.class,
>>          RelOptHelper.any(ScanPrel.class)), "TestRule") {
>>    @Override
>>    public void onMatch(RelOptRuleCall call) {
>>      FilterPrel filter = (FilterPrel) call.rel(0);
>>      ScanPrel scan = (ScanPrel) call.rel(1);
>>      RexNode condition = filter.getCondition();
>> 
>>      // Then we can do the scan optimization with “filter”, which
>> including IN condition.
>>    }
>>  };
>> 
>> But when the values are over 20, the physical plan becomes like this:
>> 
> 
> There is this drill config parameter (drill>= 1.8) in_subquery_threshold .
> Default value is 20. Just set it to a very high value (its a long field)
> 
> 
> 
>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
>> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,
>> 13,14,15,16,17,18,19,20);
>> +------+------+
>> | text | json |
>> +------+------+
>> | 00-00    Screen
>> 00-01      UnionExchange
>> 01-01        Project(user_id=[$0])
>> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
>> 01-04            Project($f15=[$0], $f69=[$0])
>> 01-05              Scan(groupscan=[TestGroupScan@
>> 132c4830{Spec=TestScanSpec@416812cf{table:test, rsFilter:null},
>> columns=[`user_id`]}])
>> 01-03            BroadcastExchange
>> 02-01              HashAgg(group=[{0}])
>> 02-02                Values
>> 
>> Now I need to find another way to know and get the IN condition, including
>> the field name, and the values of IN clause. How can I do this?
>> 
>> 
>> Regards
>> Flow Wei
>> 
>> 
>> 
>>> On Nov 16, 2016, at 13:32, Julian Hyde <[email protected]> wrote:
>>> 
>>> I’m not sure what you mean by “notified about the IN condition”. You
>> either have to convert it to a semi-join or not.
>>> 
>>> I don’t know how expensive hash-joins are in Drill. If they’re
>> expensive, you could hand-write a UDF that builds a java hash-map, and see
>> whether it performs better.
>>> 
>>> Julian
>>> 
>>> 
>>>> On Nov 15, 2016, at 7:38 PM, WeiWan <[email protected]> wrote:
>>>> 
>>>> Hi!
>>>> 
>>>> I’m working on a new data format using Apache Drill as query engine. It
>> has very nice scan speed, high compression ratio, and contains indexes to
>> filter out irrelevant parts. Even more interesting is it supports extremely
>> fast messages realtime ingestion. It is not yet open sourced but soon.
>>>> 
>>>> But right now we run into a issue and need your help. We know Drill can
>> push down predicates to GroupScan. But it may transform the large IN
>> condition into an HashJoin to improve performance. Like this:
>>>> 
>>>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
>> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,
>> 13,14,15,16,17,18,19,20);
>>>> +------+------+
>>>> | text | json |
>>>> +------+------+
>>>> | 00-00    Screen
>>>> 00-01      UnionExchange
>>>> 01-01        Project(user_id=[$0])
>>>> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
>>>> 01-04            Project($f15=[$0], $f69=[$0])
>>>> 01-05              Scan(groupscan=[TestGroupScan@
>> 4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, rsFilter:null},
>> columns=[`user_id`]}])
>>>> 01-03            BroadcastExchange
>>>> 02-01              HashAgg(group=[{0}])
>>>> 02-02                Values
>>>> 
>>>> 
>>>> The problem is we need to know there is a IN condition, either by
>> predicates push down or other ways, so that we can do the optimization job.
>>>> 
>>>> Drill has an option “planner.in_subquery_threshold” can prevent the
>> planner from transforming the IN condition, but we would like to keep that
>> untouched because a large  “or (equals … )” can also slow down things.
>>>> 
>>>> So our question is, if we keep the “in_subquery” planner optimization,
>> is there any ways we can be notified about the IN condition?
>>>> 
>>>> Looking forward to your reply!
>>>> 
>>>> Regards
>>>> Flow Wei

Re: About the IN condition push down.

Reply via email to