[jira] [Resolved] (HIVE-27278) Simplify correlated queries with empty inputs

Stamatis Zampetakis (Jira) Wed, 26 Apr 2023 02:26:17 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stamatis Zampetakis resolved HIVE-27278.
----------------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Fixed in 
[https://github.com/apache/hive/commit/69d824e186ca983b8bb1ff9501b069522391dc44.]
 Thanks for the review [~kkasa] !

> Simplify correlated queries with empty inputs
> ---------------------------------------------
>
>                 Key: HIVE-27278
>                 URL: https://issues.apache.org/jira/browse/HIVE-27278
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The correlated query below will not produce any result no matter the content 
> of the table.
> {code:sql}
> create table t1 (id int, val varchar(10)) stored as orc TBLPROPERTIES 
> ('transactional'='true');
> create table t2 (id int, val varchar(10)) stored as orc TBLPROPERTIES 
> ('transactional'='true');
> EXPLAIN CBO SELECT id FROM t1 WHERE NULL IN (SELECT NULL FROM t2 where t1.id 
> = t2.id);
> {code}
> The CBO is able to derive that part of the query is empty and ends up with 
> the following plan.
> {noformat}
> CBO PLAN:
> HiveProject(id=[$0])
>   LogicalCorrelate(correlation=[$cor0], joinType=[semi], requiredColumns=[{}])
>     HiveTableScan(table=[[default, t1]], table:alias=[t1])
>     HiveValues(tuples=[[]])
> {noformat}
> The presence of LogicalCorrelate is first redundant but also problematic 
> since many parts of the optimizer assume that queries are decorrelated and do 
> not know how to handle the LogicalCorrelate.
> In the presence of views the same query can lead to the following exception 
> during compilation.
> {code:sql}
> CREATE MATERIALIZED VIEW v1 AS SELECT id FROM t2;
> EXPLAIN CBO SELECT id FROM t1 WHERE NULL IN (SELECT NULL FROM t2 where t1.id 
> = t2.id);
> {code}
> {noformat}
> org.apache.calcite.plan.RelOptPlanner$CannotPlanException: There are not 
> enough rules to produce a node with desired properties: convention=HIVE, 
> sort=[], dist=any. All the inputs have relevant nodes, however the cost is 
> still infinite.
> Root: rel#185:RelSubset#3.HIVE.[].any
> Original rel:
> HiveProject(id=[$0]): rowcount = 4.0, cumulative cost = {20.0 rows, 13.0 cpu, 
> 0.0 io}, id = 178
>   LogicalCorrelate(correlation=[$cor0], joinType=[semi], 
> requiredColumns=[{}]): rowcount = 4.0, cumulative cost = {16.0 rows, 9.0 cpu, 
> 0.0 io}, id = 176
>     HiveTableScan(table=[[default, t1]], table:alias=[t1]): rowcount = 4.0, 
> cumulative cost = {4.0 rows, 5.0 cpu, 0.0 io}, id = 111
>     HiveValues(tuples=[[]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 
> cpu, 0.0 io}, id = 139
> Sets:
> Set#0, type: RecordType(INTEGER id, VARCHAR(10) val, BIGINT 
> BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, 
> RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN 
> ROW__IS__DELETED)
>       rel#180:RelSubset#0.HIVE.[].any, best=rel#111
>               rel#111:HiveTableScan.HIVE.[].any(table=[default, 
> t1],htColumns=[0, 1, 2, 3, 4, 
> 5],insideView=false,plKey=default.t1;,table:alias=t1,tableScanTrait=null), 
> rowcount=4.0, cumulative cost={4.0 rows, 5.0 cpu, 0.0 io}
> Set#1, type: RecordType(NULL _o__c0)
>       rel#181:RelSubset#1.HIVE.[].any, best=rel#139
>               rel#139:HiveValues.HIVE.[].any(type=RecordType(NULL 
> _o__c0),tuples=[]), rowcount=1.0, cumulative cost={1.0 rows, 1.0 cpu, 0.0 io}
> Set#2, type: RecordType(INTEGER id, VARCHAR(10) val, BIGINT 
> BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, 
> RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN 
> ROW__IS__DELETED)
>       rel#183:RelSubset#2.NONE.[].any, best=null
>               
> rel#182:LogicalCorrelate.NONE.[].any(left=RelSubset#180,right=RelSubset#181,correlation=$cor0,joinType=semi,requiredColumns={}),
>  rowcount=4.0, cumulative cost={inf}
> Set#3, type: RecordType(INTEGER id)
>       rel#185:RelSubset#3.HIVE.[].any, best=null
>               
> rel#184:HiveProject.HIVE.[].any(input=RelSubset#183,inputs=0,synthetic=false),
>  rowcount=4.0, cumulative cost={inf}
> Graphviz:
> digraph G {
>       root [style=filled,label="Root"];
>       subgraph cluster0{
>               label="Set 0 RecordType(INTEGER id, VARCHAR(10) val, BIGINT 
> BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, 
> RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN 
> ROW__IS__DELETED)";
>               rel111 [label="rel#111:HiveTableScan\ntable=[default, 
> t1],htColumns=[0, 1, 2, 3, 4, 
> 5],insideView=false,plKey=default.t1;,table:alias=t1,tableScanTrait=null\nrows=4.0,
>  cost={4.0 rows, 5.0 cpu, 0.0 io}",color=blue,shape=box]
>               subset180 [label="rel#180:RelSubset#0.HIVE.[].any"]
>       }
>       subgraph cluster1{
>               label="Set 1 RecordType(NULL _o__c0)";
>               rel139 [label="rel#139:HiveValues\ntype=RecordType(NULL 
> _o__c0),tuples=[]\nrows=1.0, cost={1.0 rows, 1.0 cpu, 0.0 
> io}",color=blue,shape=box]
>               subset181 [label="rel#181:RelSubset#1.HIVE.[].any"]
>       }
>       subgraph cluster2{
>               label="Set 2 RecordType(INTEGER id, VARCHAR(10) val, BIGINT 
> BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, 
> RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN 
> ROW__IS__DELETED)";
>               rel182 
> [label="rel#182:LogicalCorrelate\nleft=RelSubset#180,right=RelSubset#181,correlation=$cor0,joinType=semi,requiredColumns={}\nrows=4.0,
>  cost={inf}",shape=box]
>               subset183 [label="rel#183:RelSubset#2.NONE.[].any"]
>       }
>       subgraph cluster3{
>               label="Set 3 RecordType(INTEGER id)";
>               rel184 
> [label="rel#184:HiveProject\ninput=RelSubset#183,inputs=0,synthetic=false\nrows=4.0,
>  cost={inf}",shape=box]
>               subset185 [label="rel#185:RelSubset#3.HIVE.[].any"]
>       }
>       root -> subset185;
>       subset180 -> rel111[color=blue];
>       subset181 -> rel139[color=blue];
>       subset183 -> rel182; rel182 -> subset180[label="0"]; rel182 -> 
> subset181[label="1"];
>       subset185 -> rel184; rel184 -> subset183;
> }
>       at 
> org.apache.calcite.plan.volcano.RelSubset$CheapestPlanReplacer.visit(RelSubset.java:742)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.calcite.plan.volcano.RelSubset.buildCheapestPlan(RelSubset.java:365)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:520)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.applyMaterializedViewRewriting(CalcitePlanner.java:2058)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1722)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1591)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.calcite.tools.Frameworks.lambda$withPlanner$0(Frameworks.java:131) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:914)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:180) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:126) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1343)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:570)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12820)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:465)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:326)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:180)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:326)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:224) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:107) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:519) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:471) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:436) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:430) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:121)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:227) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:257) 
> ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
>       at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201) 
> ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
>       at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127) 
> ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
>       at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:425) 
> ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
>       at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:356) 
> ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:733) 
> ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:703) 
> ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:115)
>  ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157) 
> ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
>  ~[test-classes/:?]
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_261]
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_261]
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_261]
>       at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_261]
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:135)
>  ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
> ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>  ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 
> ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>  ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.Suite.runChild(Suite.java:128) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.Suite.runChild(Suite.java:27) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 
> ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.apache.hadoop.hive.cli.control.CliAdapter$1$1.evaluate(CliAdapter.java:95)
>  ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at org.junit.rules.RunRules.evaluate(RunRules.java:20) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 
> ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:377)
>  ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:138) 
> ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:465) 
> ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:451) 
> ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
> {noformat}
> The goal of this ticket is to get rid of the redundant correlation to avoid 
> compilation failures but also for unlocking further simplifications and 
> improving plan readability.
> The plan can be simplified further based on the following observations.
> If the right side of the correlate is empty then the whole correlate is empty 
> when joinType is SEMI/INNER. Moreover if correlate type is LEFT then we can 
> also drop the correlate and use t1 padded with nulls for the right side. 
> Lastly, if the type is ANTI then result is the entire t1 so the correlate can 
> also be dropped. RIGHT and FULL correlations are invalid and should never 
> appear in the plan.
> If the left side of the correlate is empty the result is empty and the 
> correlation can be dropped for every legal joinType (INNER/SEMI/ANTI/LEFT).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HIVE-27278) Simplify correlated queries with empty inputs

Reply via email to