Hi Nicolas,
I know the option “planner.in_subquery_threshold”. But as a Drill plugin, we
don’t want to force people to set this option. Besides it could hurt the
performance if we set it to a very large number.
I’m considering using a rule like this to get the in-lists:
public static StoragePluginOptimizerRule JoinFromIn = new
StoragePluginOptimizerRule(
RelOptHelper.some(HashJoinPrel.class,
RelOptHelper.some(ProjectPrel.class,
RelOptHelper.any(ScanPrel.class)),
RelOptHelper.some(BroadcastExchangePrel.class,
RelOptHelper.some(HashAggPrel.class,
RelOptHelper.any(ValuesPrel.class))))
, "JoinFromIn") {
// ....
}
from the physical plan :
| 00-00 Screen
00-01 UnionExchange
01-01 Project(user_id=[$0])
01-02 HashJoin(condition=[=($1, $2)], joinType=[inner])
01-04 Project($f15=[$0], $f69=[$0])
01-05
Scan(groupscan=[TestGroupScan@4013af3d{Spec=TestScanSpec@73ccb9dc{table:test,
rsFilter:null}, columns=[`user_id`]}])
01-03 BroadcastExchange
02-01 HashAgg(group=[{0}])
02-02 Values
When I found a HashJoin with a Scan and a HashAgg with Values, I can assume
that it is transform from an in-lists.
But I found that it doesn’t match any RelNode. Any suggestions for my rule?
Regards
Flow Wei
> On Nov 16, 2016, at 15:31, Nicolas Paris <[email protected]> wrote:
>
> 2016-11-16 8:17 GMT+01:00 WeiWan <[email protected]
> <mailto:[email protected]>>:
>
>> Hi Julian,
>>
>> What I mean by “notified about the IN condition" here is My TestGroupScan
>> should knows the IN condition when people writes SQLs like “WHERE a IN
>> (…)”. It is difficult for me because Drill’s physical planner has transform
>> it into a HashJoin.
>>
>> For example if the values of IN condition lest than 20, the physical plan
>> looks like this:
>>
>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
>> test where user_id in (1,2,3,4,5,6);
>> +------+------+
>> | text | json |
>> +------+------+
>> | 00-00 Screen
>> 00-01 UnionExchange
>> 01-01 Project(user_id=[$0])
>> 01-02 SelectionVectorRemover
>> 01-03 Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), =($0,
>> 4), =($0, 5), =($0, 6))])
>> 01-04 Scan(groupscan=[TestGroupScan@
>> 58253efd{Spec=TestScanSpec@893899c{table:test, rsFilter:In($user_id:
>> (6)[1,2,3,4,5 ...])}, columns=[`user_id`]}])
>>
>> And I can use a rule like this to fetch the filter:
>>
>> public static StoragePluginOptimizerRule TestRule = new
>> StoragePluginOptimizerRule(
>> RelOptHelper.some(FilterPrel.class,
>> RelOptHelper.any(ScanPrel.class)), "TestRule") {
>> @Override
>> public void onMatch(RelOptRuleCall call) {
>> FilterPrel filter = (FilterPrel) call.rel(0);
>> ScanPrel scan = (ScanPrel) call.rel(1);
>> RexNode condition = filter.getCondition();
>>
>> // Then we can do the scan optimization with “filter”, which
>> including IN condition.
>> }
>> };
>>
>> But when the values are over 20, the physical plan becomes like this:
>>
>
> There is this drill config parameter (drill>= 1.8) in_subquery_threshold .
> Default value is 20. Just set it to a very high value (its a long field)
>
>
>
>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
>> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,
>> 13,14,15,16,17,18,19,20);
>> +------+------+
>> | text | json |
>> +------+------+
>> | 00-00 Screen
>> 00-01 UnionExchange
>> 01-01 Project(user_id=[$0])
>> 01-02 HashJoin(condition=[=($1, $2)], joinType=[inner])
>> 01-04 Project($f15=[$0], $f69=[$0])
>> 01-05 Scan(groupscan=[TestGroupScan@
>> 132c4830{Spec=TestScanSpec@416812cf{table:test, rsFilter:null},
>> columns=[`user_id`]}])
>> 01-03 BroadcastExchange
>> 02-01 HashAgg(group=[{0}])
>> 02-02 Values
>>
>> Now I need to find another way to know and get the IN condition, including
>> the field name, and the values of IN clause. How can I do this?
>>
>>
>> Regards
>> Flow Wei
>>
>>
>>
>>> On Nov 16, 2016, at 13:32, Julian Hyde <[email protected]> wrote:
>>>
>>> I’m not sure what you mean by “notified about the IN condition”. You
>> either have to convert it to a semi-join or not.
>>>
>>> I don’t know how expensive hash-joins are in Drill. If they’re
>> expensive, you could hand-write a UDF that builds a java hash-map, and see
>> whether it performs better.
>>>
>>> Julian
>>>
>>>
>>>> On Nov 15, 2016, at 7:38 PM, WeiWan <[email protected]> wrote:
>>>>
>>>> Hi!
>>>>
>>>> I’m working on a new data format using Apache Drill as query engine. It
>> has very nice scan speed, high compression ratio, and contains indexes to
>> filter out irrelevant parts. Even more interesting is it supports extremely
>> fast messages realtime ingestion. It is not yet open sourced but soon.
>>>>
>>>> But right now we run into a issue and need your help. We know Drill can
>> push down predicates to GroupScan. But it may transform the large IN
>> condition into an HashJoin to improve performance. Like this:
>>>>
>>>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
>> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,
>> 13,14,15,16,17,18,19,20);
>>>> +------+------+
>>>> | text | json |
>>>> +------+------+
>>>> | 00-00 Screen
>>>> 00-01 UnionExchange
>>>> 01-01 Project(user_id=[$0])
>>>> 01-02 HashJoin(condition=[=($1, $2)], joinType=[inner])
>>>> 01-04 Project($f15=[$0], $f69=[$0])
>>>> 01-05 Scan(groupscan=[TestGroupScan@
>> 4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, rsFilter:null},
>> columns=[`user_id`]}])
>>>> 01-03 BroadcastExchange
>>>> 02-01 HashAgg(group=[{0}])
>>>> 02-02 Values
>>>>
>>>>
>>>> The problem is we need to know there is a IN condition, either by
>> predicates push down or other ways, so that we can do the optimization job.
>>>>
>>>> Drill has an option “planner.in_subquery_threshold” can prevent the
>> planner from transforming the IN condition, but we would like to keep that
>> untouched because a large “or (equals … )” can also slow down things.
>>>>
>>>> So our question is, if we keep the “in_subquery” planner optimization,
>> is there any ways we can be notified about the IN condition?
>>>>
>>>> Looking forward to your reply!
>>>>
>>>> Regards
>>>> Flow Wei