[
https://issues.apache.org/jira/browse/DRILL-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968050#comment-16968050
]
Boaz Ben-Zvi commented on DRILL-6949:
-------------------------------------
Tried adding this code (in onMatch() in HashJoinPrule.java) to use stats and
make the planning decision:
{code:java}
/*
* For semi-join: When there are too many key-duplicates on the right side
(i.e., more than %50, based on stats)
* use the old style plan (with a hash-aggr below the build side to eliminate
the duplicates)
* (Else the hash-join memory would be overloaded with those duplicate entries)
*/
if ( isSemi &&
settings.isStatisticsEnabled() && // only when stats is enabled
call.getMetadataQuery().getRowCount(right) >
2 * call.getMetadataQuery().getDistinctRowCount(right,
ImmutableBitSet.builder().addAll(join.getRightKeys()).build(), null) ) {
return;
}{code}
However the following test failed (to restore the Hash-Aggr):
{code:java}
analyze table lineitem compute statistics;
SET `planner.statistics.use` = true;
select count(*) from lineitem T1 where T1.l_discount in (select
distinct(cast(T2.l_discount as double)) from lineitem T2);
{code}
As apparently the *cast* "confused" the code that gets the number of distinct
rows (which returned a much higher number).
BTW, trying this query without a cast fails to plan:
{code:java}
apache drill (dfs.tmp)> select count(*) from lineitem T1 where T1.l_discount in
(select T2.l_discount from lineitem T2);
Error: SYSTEM ERROR: CannotPlanException: There are not enough rules to produce
a node with desired properties: convention=PHYSICAL,
DrillDistributionTraitDef=SINGLETON([]), sort=[].
Missing conversion is DrillAggregateRel[convention: LOGICAL -> PHYSICAL,
DrillDistributionTraitDef: ANY([]) -> SINGLETON([])]
There is 1 empty subset: rel#1710:Subset#19.PHYSICAL.SINGLETON([]).[], the
relevant part of the original plan is as follows
1702:DrillAggregateRel(group=[{}], EXPR$0=[COUNT()])
1699:DrillProjectRel(subset=[rel#1700:Subset#18.LOGICAL.ANY([]).[]], $f0=[0])
1697:DrillSemiJoinRel(subset=[rel#1698:Subset#17.LOGICAL.ANY([]).[]],
condition=[=($0, $1)], joinType=[semi])
1556:DrillScanRel(subset=[rel#1696:Subset#16.LOGICAL.ANY([]).[]],
table=[[dfs, tmp, lineitem]], groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath [path=file:/tmp/lineitem]],
selectionRoot=file:/tmp/lineitem, numFiles=1, numRowGroups=3,
usedMetadataFile=false, columns=[`l_discount`]]])
1556:DrillScanRel(subset=[rel#1696:Subset#16.LOGICAL.ANY([]).[]],
table=[[dfs, tmp, lineitem]], groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath [path=file:/tmp/lineitem]],
selectionRoot=file:/tmp/lineitem, numFiles=1, numRowGroups=3,
usedMetadataFile=false, columns=[`l_discount`]]])
{code}
> Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition
> the inner data any further" when Semi join is enabled
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: DRILL-6949
> URL: https://issues.apache.org/jira/browse/DRILL-6949
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 1.15.0
> Reporter: Abhishek Ravi
> Assignee: Boaz Ben-Zvi
> Priority: Major
> Fix For: Future
>
> Attachments: 23cc1240-74ff-a0c0-8cd5-938fc136e4e2.sys.drill,
> 23cc1369-0812-63ce-1861-872636571437.sys.drill
>
>
> Following query fails when with *Error: UNSUPPORTED_OPERATION ERROR:
> Hash-Join can not partition the inner data any further (probably due to too
> many join-key duplicates)* on TPC-H SF100 data.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 10000;
> set `planner.enable_broadcast_join` = false;
> select
> count(*)
> from
> lineitem l1
> where
> l1.l_discount IN (
> select
> distinct(cast(l2.l_discount as double))
> from
> lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> {code}
> The subquery contains *distinct* keyword and hence there should not be
> duplicate values.
> I suspect that the failure is caused by semijoin because the query succeeds
> when semijoin is disabled explicitly.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)