Hi Julian,

What I mean by “notified about the IN condition" here is My TestGroupScan 
should knows the IN condition when people writes SQLs like “WHERE a IN (…)”. It 
is difficult for me because Drill’s physical planner has transform it into a 
HashJoin.

For example if the values of IN condition lest than 20, the physical plan looks 
like this:

0: jdbc:drill:drillbit=localhost> explain plan for select user_id from test 
where user_id in (1,2,3,4,5,6);
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      UnionExchange
01-01        Project(user_id=[$0])
01-02          SelectionVectorRemover
01-03            Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), =($0, 4), 
=($0, 5), =($0, 6))])
01-04              
Scan(groupscan=[TestGroupScan@58253efd{Spec=TestScanSpec@893899c{table:test, 
rsFilter:In($user_id: (6)[1,2,3,4,5 ...])}, columns=[`user_id`]}])

And I can use a rule like this to fetch the filter:

  public static StoragePluginOptimizerRule TestRule = new 
StoragePluginOptimizerRule(
      RelOptHelper.some(FilterPrel.class,
          RelOptHelper.any(ScanPrel.class)), "TestRule") {
    @Override
    public void onMatch(RelOptRuleCall call) {
      FilterPrel filter = (FilterPrel) call.rel(0);
      ScanPrel scan = (ScanPrel) call.rel(1);
      RexNode condition = filter.getCondition();

      // Then we can do the scan optimization with “filter”, which including IN 
condition.
    }
  };

But when the values are over 20, the physical plan becomes like this:

0: jdbc:drill:drillbit=localhost> explain plan for select user_id from test 
where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      UnionExchange
01-01        Project(user_id=[$0])
01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
01-04            Project($f15=[$0], $f69=[$0])
01-05              
Scan(groupscan=[TestGroupScan@132c4830{Spec=TestScanSpec@416812cf{table:test, 
rsFilter:null}, columns=[`user_id`]}])
01-03            BroadcastExchange
02-01              HashAgg(group=[{0}])
02-02                Values

Now I need to find another way to know and get the IN condition, including the 
field name, and the values of IN clause. How can I do this?


Regards
Flow Wei



> On Nov 16, 2016, at 13:32, Julian Hyde <[email protected]> wrote:
> 
> I’m not sure what you mean by “notified about the IN condition”. You either 
> have to convert it to a semi-join or not. 
> 
> I don’t know how expensive hash-joins are in Drill. If they’re expensive, you 
> could hand-write a UDF that builds a java hash-map, and see whether it 
> performs better.
> 
> Julian
> 
> 
>> On Nov 15, 2016, at 7:38 PM, WeiWan <[email protected]> wrote:
>> 
>> Hi!
>> 
>> I’m working on a new data format using Apache Drill as query engine. It has 
>> very nice scan speed, high compression ratio, and contains indexes to filter 
>> out irrelevant parts. Even more interesting is it supports extremely fast 
>> messages realtime ingestion. It is not yet open sourced but soon.
>> 
>> But right now we run into a issue and need your help. We know Drill can push 
>> down predicates to GroupScan. But it may transform the large IN condition 
>> into an HashJoin to improve performance. Like this:
>> 
>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from test 
>> where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
>> +------+------+
>> | text | json |
>> +------+------+
>> | 00-00    Screen
>> 00-01      UnionExchange
>> 01-01        Project(user_id=[$0])
>> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
>> 01-04            Project($f15=[$0], $f69=[$0])
>> 01-05              
>> Scan(groupscan=[TestGroupScan@4013af3d{Spec=TestScanSpec@73ccb9dc{table:test,
>>  rsFilter:null}, columns=[`user_id`]}])
>> 01-03            BroadcastExchange
>> 02-01              HashAgg(group=[{0}])
>> 02-02                Values
>> 
>> 
>> The problem is we need to know there is a IN condition, either by predicates 
>> push down or other ways, so that we can do the optimization job. 
>> 
>> Drill has an option “planner.in_subquery_threshold” can prevent the planner 
>> from transforming the IN condition, but we would like to keep that untouched 
>> because a large  “or (equals … )” can also slow down things.
>> 
>> So our question is, if we keep the “in_subquery” planner optimization, is 
>> there any ways we can be notified about the IN condition?
>> 
>> Looking forward to your reply!
>> 
>> Regards
>> Flow Wei
>> 
>> 
>> 
> 

Reply via email to