[ 
https://issues.apache.org/jira/browse/DRILL-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968050#comment-16968050
 ] 

Boaz Ben-Zvi commented on DRILL-6949:
-------------------------------------

Tried adding this code (in onMatch() in HashJoinPrule.java) to use stats and 
make the planning decision:
{code:java}
/*
 * For semi-join: When there are too many key-duplicates on the right side 
(i.e., more than %50, based on stats)
 * use the old style plan (with a hash-aggr below the build side to eliminate 
the duplicates)
 * (Else the hash-join memory would be overloaded with those duplicate entries)
 */
if ( isSemi &&
 settings.isStatisticsEnabled() && // only when stats is enabled
 call.getMetadataQuery().getRowCount(right) >
 2 * call.getMetadataQuery().getDistinctRowCount(right, 
ImmutableBitSet.builder().addAll(join.getRightKeys()).build(), null) ) {
 return;
}{code}
However the following test failed (to restore the Hash-Aggr):

 
{code:java}
analyze table lineitem compute statistics;
SET `planner.statistics.use` = true;
select count(*) from lineitem T1 where T1.l_discount in (select 
distinct(cast(T2.l_discount as double)) from lineitem T2);
{code}
 

As apparently the *cast* "confused" the code that gets the number of distinct 
rows (which returned a much higher number).

BTW, trying this query without a cast fails to plan:

 
{code:java}
apache drill (dfs.tmp)> select count(*) from lineitem T1 where T1.l_discount in 
(select T2.l_discount from lineitem T2);
Error: SYSTEM ERROR: CannotPlanException: There are not enough rules to produce 
a node with desired properties: convention=PHYSICAL, 
DrillDistributionTraitDef=SINGLETON([]), sort=[].
Missing conversion is DrillAggregateRel[convention: LOGICAL -> PHYSICAL, 
DrillDistributionTraitDef: ANY([]) -> SINGLETON([])]
There is 1 empty subset: rel#1710:Subset#19.PHYSICAL.SINGLETON([]).[], the 
relevant part of the original plan is as follows
1702:DrillAggregateRel(group=[{}], EXPR$0=[COUNT()])
  1699:DrillProjectRel(subset=[rel#1700:Subset#18.LOGICAL.ANY([]).[]], $f0=[0])
    1697:DrillSemiJoinRel(subset=[rel#1698:Subset#17.LOGICAL.ANY([]).[]], 
condition=[=($0, $1)], joinType=[semi])
      1556:DrillScanRel(subset=[rel#1696:Subset#16.LOGICAL.ANY([]).[]], 
table=[[dfs, tmp, lineitem]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=file:/tmp/lineitem]], 
selectionRoot=file:/tmp/lineitem, numFiles=1, numRowGroups=3, 
usedMetadataFile=false, columns=[`l_discount`]]])
      1556:DrillScanRel(subset=[rel#1696:Subset#16.LOGICAL.ANY([]).[]], 
table=[[dfs, tmp, lineitem]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=file:/tmp/lineitem]], 
selectionRoot=file:/tmp/lineitem, numFiles=1, numRowGroups=3, 
usedMetadataFile=false, columns=[`l_discount`]]])
{code}
 

 

> Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition 
> the inner data any further" when Semi join is enabled
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-6949
>                 URL: https://issues.apache.org/jira/browse/DRILL-6949
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.15.0
>            Reporter: Abhishek Ravi
>            Assignee: Boaz Ben-Zvi
>            Priority: Major
>             Fix For: Future
>
>         Attachments: 23cc1240-74ff-a0c0-8cd5-938fc136e4e2.sys.drill, 
> 23cc1369-0812-63ce-1861-872636571437.sys.drill
>
>
> Following query fails when with *Error: UNSUPPORTED_OPERATION ERROR: 
> Hash-Join can not partition the inner data any further (probably due to too 
> many join-key duplicates)* on TPC-H SF100 data.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 10000;
> set `planner.enable_broadcast_join` = false;
> select
>  count(*)
> from
>  lineitem l1
> where
>  l1.l_discount IN (
>  select
>  distinct(cast(l2.l_discount as double))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> {code}
> The subquery contains *distinct* keyword and hence there should not be 
> duplicate values. 
> I suspect that the failure is caused by semijoin because the query succeeds 
> when semijoin is disabled explicitly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to