That sounds perfect if we can push down all those where conditions into scan/reader. Really like to see it in the next few releases đ.
Regards Flow Wei > On Nov 18, 2016, at 03:15, Chunhui Shi <[email protected]> wrote: > > I feel that we should allow preserving IN operator and not transforming it > to OR or HashJoin, and seems this has to be done in Calcite. So we could > allow pushing IN to scan/reader to utilize progressing data formats and > index designs, e.g. min/max, a variety of bloom filters. > > On Wed, Nov 16, 2016 at 12:59 AM, WeiWan <[email protected] > <mailto:[email protected]>> wrote: > >> Hi Nicolas, >> >> I know the option âplanner.in_subquery_thresholdâ. But as a Drill plugin, >> we donât want to force people to set this option. Besides it could hurt the >> performance if we set it to a very large number. >> >> Iâm considering using a rule like this to get the in-lists: >> >> public static StoragePluginOptimizerRule JoinFromIn = new >> StoragePluginOptimizerRule( >> RelOptHelper.some(HashJoinPrel.class, >> RelOptHelper.some(ProjectPrel.class, >> RelOptHelper.any(ScanPrel.class)), >> RelOptHelper.some(BroadcastExchangePrel.class, >> RelOptHelper.some(HashAggPrel.class, >> RelOptHelper.any(ValuesPrel.class)))) >> , "JoinFromIn") { >> >> // .... >> >> } >> >> from the physical plan : >> >> | 00-00 Screen >> 00-01 UnionExchange >> 01-01 Project(user_id=[$0]) >> 01-02 HashJoin(condition=[=($1, $2)], joinType=[inner]) >> 01-04 Project($f15=[$0], $f69=[$0]) >> 01-05 Scan(groupscan=[TestGroupScan@ >> 4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, rsFilter:null}, >> columns=[`user_id`]}]) >> 01-03 BroadcastExchange >> 02-01 HashAgg(group=[{0}]) >> 02-02 Values >> >> >> When I found a HashJoin with a Scan and a HashAgg with Values, I can >> assume that it is transform from an in-lists. >> >> But I found that it doesnât match any RelNode. Any suggestions for my >> rule? >> >> >> Regards >> Flow Wei >> >> >> >>> On Nov 16, 2016, at 15:31, Nicolas Paris <[email protected]> wrote: >>> >>> 2016-11-16 8:17 GMT+01:00 WeiWan <[email protected] >>> <mailto:[email protected]> <mailto: >> [email protected]>>: >>> >>>> Hi Julian, >>>> >>>> What I mean by ânotified about the IN condition" here is My >> TestGroupScan >>>> should knows the IN condition when people writes SQLs like âWHERE a IN >>>> (âŚ)â. It is difficult for me because Drillâs physical planner has >> transform >>>> it into a HashJoin. >>>> >>>> For example if the values of IN condition lest than 20, the physical >> plan >>>> looks like this: >>>> >>>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from >>>> test where user_id in (1,2,3,4,5,6); >>>> +------+------+ >>>> | text | json | >>>> +------+------+ >>>> | 00-00 Screen >>>> 00-01 UnionExchange >>>> 01-01 Project(user_id=[$0]) >>>> 01-02 SelectionVectorRemover >>>> 01-03 Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), >> =($0, >>>> 4), =($0, 5), =($0, 6))]) >>>> 01-04 Scan(groupscan=[TestGroupScan@ >>>> 58253efd{Spec=TestScanSpec@893899c{table:test, rsFilter:In($user_id: >>>> (6)[1,2,3,4,5 ...])}, columns=[`user_id`]}]) >>>> >>>> And I can use a rule like this to fetch the filter: >>>> >>>> public static StoragePluginOptimizerRule TestRule = new >>>> StoragePluginOptimizerRule( >>>> RelOptHelper.some(FilterPrel.class, >>>> RelOptHelper.any(ScanPrel.class)), "TestRule") { >>>> @Override >>>> public void onMatch(RelOptRuleCall call) { >>>> FilterPrel filter = (FilterPrel) call.rel(0); >>>> ScanPrel scan = (ScanPrel) call.rel(1); >>>> RexNode condition = filter.getCondition(); >>>> >>>> // Then we can do the scan optimization with âfilterâ, which >>>> including IN condition. >>>> } >>>> }; >>>> >>>> But when the values are over 20, the physical plan becomes like this: >>>> >>> >>> âThere is this drill config parameter (drill>= 1.8) >> in_subquery_threshold . >>> Default value is 20. Just set it to a very high value (its a long field)â >>> >>> >>> >>>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from >>>> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12, >>>> 13,14,15,16,17,18,19,20); >>>> +------+------+ >>>> | text | json | >>>> +------+------+ >>>> | 00-00 Screen >>>> 00-01 UnionExchange >>>> 01-01 Project(user_id=[$0]) >>>> 01-02 HashJoin(condition=[=($1, $2)], joinType=[inner]) >>>> 01-04 Project($f15=[$0], $f69=[$0]) >>>> 01-05 Scan(groupscan=[TestGroupScan@ >>>> 132c4830{Spec=TestScanSpec@416812cf{table:test, rsFilter:null}, >>>> columns=[`user_id`]}]) >>>> 01-03 BroadcastExchange >>>> 02-01 HashAgg(group=[{0}]) >>>> 02-02 Values >>>> >>>> Now I need to find another way to know and get the IN condition, >> including >>>> the field name, and the values of IN clause. How can I do this? >>>> >>>> >>>> Regards >>>> Flow Wei >>>> >>>> >>>> >>>>> On Nov 16, 2016, at 13:32, Julian Hyde <[email protected]> wrote: >>>>> >>>>> Iâm not sure what you mean by ânotified about the IN conditionâ. You >>>> either have to convert it to a semi-join or not. >>>>> >>>>> I donât know how expensive hash-joins are in Drill. If theyâre >>>> expensive, you could hand-write a UDF that builds a java hash-map, and >> see >>>> whether it performs better. >>>>> >>>>> Julian >>>>> >>>>> >>>>>> On Nov 15, 2016, at 7:38 PM, WeiWan <[email protected]> wrote: >>>>>> >>>>>> Hi! >>>>>> >>>>>> Iâm working on a new data format using Apache Drill as query engine. >> It >>>> has very nice scan speed, high compression ratio, and contains indexes >> to >>>> filter out irrelevant parts. Even more interesting is it supports >> extremely >>>> fast messages realtime ingestion. It is not yet open sourced but soon. >>>>>> >>>>>> But right now we run into a issue and need your help. We know Drill >> can >>>> push down predicates to GroupScan. But it may transform the large IN >>>> condition into an HashJoin to improve performance. Like this: >>>>>> >>>>>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from >>>> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12, >>>> 13,14,15,16,17,18,19,20); >>>>>> +------+------+ >>>>>> | text | json | >>>>>> +------+------+ >>>>>> | 00-00 Screen >>>>>> 00-01 UnionExchange >>>>>> 01-01 Project(user_id=[$0]) >>>>>> 01-02 HashJoin(condition=[=($1, $2)], joinType=[inner]) >>>>>> 01-04 Project($f15=[$0], $f69=[$0]) >>>>>> 01-05 Scan(groupscan=[TestGroupScan@ >>>> 4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, rsFilter:null}, >>>> columns=[`user_id`]}]) >>>>>> 01-03 BroadcastExchange >>>>>> 02-01 HashAgg(group=[{0}]) >>>>>> 02-02 Values >>>>>> >>>>>> >>>>>> The problem is we need to know there is a IN condition, either by >>>> predicates push down or other ways, so that we can do the optimization >> job. >>>>>> >>>>>> Drill has an option âplanner.in_subquery_thresholdâ can prevent the >>>> planner from transforming the IN condition, but we would like to keep >> that >>>> untouched because a large âor (equals ⌠)â can also slow down things. >>>>>> >>>>>> So our question is, if we keep the âin_subqueryâ planner optimization, >>>> is there any ways we can be notified about the IN condition? >>>>>> >>>>>> Looking forward to your reply! >>>>>> >>>>>> Regards >>>>>> Flow Wei
