Hi Julian,
What I mean by “notified about the IN condition" here is My TestGroupScan
should knows the IN condition when people writes SQLs like “WHERE a IN (…)”. It
is difficult for me because Drill’s physical planner has transform it into a
HashJoin.
For example if the values of IN condition lest than 20, the physical plan looks
like this:
0: jdbc:drill:drillbit=localhost> explain plan for select user_id from test
where user_id in (1,2,3,4,5,6);
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 UnionExchange
01-01 Project(user_id=[$0])
01-02 SelectionVectorRemover
01-03 Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), =($0, 4),
=($0, 5), =($0, 6))])
01-04
Scan(groupscan=[TestGroupScan@58253efd{Spec=TestScanSpec@893899c{table:test,
rsFilter:In($user_id: (6)[1,2,3,4,5 ...])}, columns=[`user_id`]}])
And I can use a rule like this to fetch the filter:
public static StoragePluginOptimizerRule TestRule = new
StoragePluginOptimizerRule(
RelOptHelper.some(FilterPrel.class,
RelOptHelper.any(ScanPrel.class)), "TestRule") {
@Override
public void onMatch(RelOptRuleCall call) {
FilterPrel filter = (FilterPrel) call.rel(0);
ScanPrel scan = (ScanPrel) call.rel(1);
RexNode condition = filter.getCondition();
// Then we can do the scan optimization with “filter”, which including IN
condition.
}
};
But when the values are over 20, the physical plan becomes like this:
0: jdbc:drill:drillbit=localhost> explain plan for select user_id from test
where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 UnionExchange
01-01 Project(user_id=[$0])
01-02 HashJoin(condition=[=($1, $2)], joinType=[inner])
01-04 Project($f15=[$0], $f69=[$0])
01-05
Scan(groupscan=[TestGroupScan@132c4830{Spec=TestScanSpec@416812cf{table:test,
rsFilter:null}, columns=[`user_id`]}])
01-03 BroadcastExchange
02-01 HashAgg(group=[{0}])
02-02 Values
Now I need to find another way to know and get the IN condition, including the
field name, and the values of IN clause. How can I do this?
Regards
Flow Wei
> On Nov 16, 2016, at 13:32, Julian Hyde <[email protected]> wrote:
>
> I’m not sure what you mean by “notified about the IN condition”. You either
> have to convert it to a semi-join or not.
>
> I don’t know how expensive hash-joins are in Drill. If they’re expensive, you
> could hand-write a UDF that builds a java hash-map, and see whether it
> performs better.
>
> Julian
>
>
>> On Nov 15, 2016, at 7:38 PM, WeiWan <[email protected]> wrote:
>>
>> Hi!
>>
>> I’m working on a new data format using Apache Drill as query engine. It has
>> very nice scan speed, high compression ratio, and contains indexes to filter
>> out irrelevant parts. Even more interesting is it supports extremely fast
>> messages realtime ingestion. It is not yet open sourced but soon.
>>
>> But right now we run into a issue and need your help. We know Drill can push
>> down predicates to GroupScan. But it may transform the large IN condition
>> into an HashJoin to improve performance. Like this:
>>
>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from test
>> where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
>> +------+------+
>> | text | json |
>> +------+------+
>> | 00-00 Screen
>> 00-01 UnionExchange
>> 01-01 Project(user_id=[$0])
>> 01-02 HashJoin(condition=[=($1, $2)], joinType=[inner])
>> 01-04 Project($f15=[$0], $f69=[$0])
>> 01-05
>> Scan(groupscan=[TestGroupScan@4013af3d{Spec=TestScanSpec@73ccb9dc{table:test,
>> rsFilter:null}, columns=[`user_id`]}])
>> 01-03 BroadcastExchange
>> 02-01 HashAgg(group=[{0}])
>> 02-02 Values
>>
>>
>> The problem is we need to know there is a IN condition, either by predicates
>> push down or other ways, so that we can do the optimization job.
>>
>> Drill has an option “planner.in_subquery_threshold” can prevent the planner
>> from transforming the IN condition, but we would like to keep that untouched
>> because a large “or (equals … )” can also slow down things.
>>
>> So our question is, if we keep the “in_subquery” planner optimization, is
>> there any ways we can be notified about the IN condition?
>>
>> Looking forward to your reply!
>>
>> Regards
>> Flow Wei
>>
>>
>>
>