Stamatis Zampetakis created HIVE-27278:
------------------------------------------

             Summary: Simplify correlated queries with empty inputs
                 Key: HIVE-27278
                 URL: https://issues.apache.org/jira/browse/HIVE-27278
             Project: Hive
          Issue Type: Bug
          Components: CBO
            Reporter: Stamatis Zampetakis
            Assignee: Stamatis Zampetakis


The correlated query below will not produce any result no matter the content of 
the table.
{code:sql}
create table t1 (id int, val varchar(10)) stored as orc TBLPROPERTIES 
('transactional'='true');
create table t2 (id int, val varchar(10)) stored as orc TBLPROPERTIES 
('transactional'='true');

EXPLAIN CBO SELECT id FROM t1 WHERE NULL IN (SELECT NULL FROM t2 where t1.id = 
t2.id);
{code}
The CBO is able to derive that part of the query is empty and ends up with the 
following plan.
{noformat}
CBO PLAN:
HiveProject(id=[$0])
  LogicalCorrelate(correlation=[$cor0], joinType=[semi], requiredColumns=[{}])
    HiveTableScan(table=[[default, t1]], table:alias=[t1])
    HiveValues(tuples=[[]])
{noformat}
The presence of LogicalCorrelate is first redundant but also problematic since 
many parts of the optimizer assume that queries are decorrelated and do not 
know how to handle the LogicalCorrelate.

In the presence of views the same query can lead to the following exception 
during compilation.
{code:sql}
CREATE MATERIALIZED VIEW v1 AS SELECT id FROM t2;
EXPLAIN CBO SELECT id FROM t1 WHERE NULL IN (SELECT NULL FROM t2 where t1.id = 
t2.id);
{code}
{noformat}
org.apache.calcite.plan.RelOptPlanner$CannotPlanException: There are not enough 
rules to produce a node with desired properties: convention=HIVE, sort=[], 
dist=any. All the inputs have relevant nodes, however the cost is still 
infinite.
Root: rel#185:RelSubset#3.HIVE.[].any
Original rel:
HiveProject(id=[$0]): rowcount = 4.0, cumulative cost = {20.0 rows, 13.0 cpu, 
0.0 io}, id = 178
  LogicalCorrelate(correlation=[$cor0], joinType=[semi], requiredColumns=[{}]): 
rowcount = 4.0, cumulative cost = {16.0 rows, 9.0 cpu, 0.0 io}, id = 176
    HiveTableScan(table=[[default, t1]], table:alias=[t1]): rowcount = 4.0, 
cumulative cost = {4.0 rows, 5.0 cpu, 0.0 io}, id = 111
    HiveValues(tuples=[[]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 
cpu, 0.0 io}, id = 139

Sets:
Set#0, type: RecordType(INTEGER id, VARCHAR(10) val, BIGINT 
BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, 
RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN 
ROW__IS__DELETED)
        rel#180:RelSubset#0.HIVE.[].any, best=rel#111
                rel#111:HiveTableScan.HIVE.[].any(table=[default, 
t1],htColumns=[0, 1, 2, 3, 4, 
5],insideView=false,plKey=default.t1;,table:alias=t1,tableScanTrait=null), 
rowcount=4.0, cumulative cost={4.0 rows, 5.0 cpu, 0.0 io}
Set#1, type: RecordType(NULL _o__c0)
        rel#181:RelSubset#1.HIVE.[].any, best=rel#139
                rel#139:HiveValues.HIVE.[].any(type=RecordType(NULL 
_o__c0),tuples=[]), rowcount=1.0, cumulative cost={1.0 rows, 1.0 cpu, 0.0 io}
Set#2, type: RecordType(INTEGER id, VARCHAR(10) val, BIGINT 
BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, 
RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN 
ROW__IS__DELETED)
        rel#183:RelSubset#2.NONE.[].any, best=null
                
rel#182:LogicalCorrelate.NONE.[].any(left=RelSubset#180,right=RelSubset#181,correlation=$cor0,joinType=semi,requiredColumns={}),
 rowcount=4.0, cumulative cost={inf}
Set#3, type: RecordType(INTEGER id)
        rel#185:RelSubset#3.HIVE.[].any, best=null
                
rel#184:HiveProject.HIVE.[].any(input=RelSubset#183,inputs=0,synthetic=false), 
rowcount=4.0, cumulative cost={inf}

Graphviz:
digraph G {
        root [style=filled,label="Root"];
        subgraph cluster0{
                label="Set 0 RecordType(INTEGER id, VARCHAR(10) val, BIGINT 
BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, 
RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN 
ROW__IS__DELETED)";
                rel111 [label="rel#111:HiveTableScan\ntable=[default, 
t1],htColumns=[0, 1, 2, 3, 4, 
5],insideView=false,plKey=default.t1;,table:alias=t1,tableScanTrait=null\nrows=4.0,
 cost={4.0 rows, 5.0 cpu, 0.0 io}",color=blue,shape=box]
                subset180 [label="rel#180:RelSubset#0.HIVE.[].any"]
        }
        subgraph cluster1{
                label="Set 1 RecordType(NULL _o__c0)";
                rel139 [label="rel#139:HiveValues\ntype=RecordType(NULL 
_o__c0),tuples=[]\nrows=1.0, cost={1.0 rows, 1.0 cpu, 0.0 
io}",color=blue,shape=box]
                subset181 [label="rel#181:RelSubset#1.HIVE.[].any"]
        }
        subgraph cluster2{
                label="Set 2 RecordType(INTEGER id, VARCHAR(10) val, BIGINT 
BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, 
RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN 
ROW__IS__DELETED)";
                rel182 
[label="rel#182:LogicalCorrelate\nleft=RelSubset#180,right=RelSubset#181,correlation=$cor0,joinType=semi,requiredColumns={}\nrows=4.0,
 cost={inf}",shape=box]
                subset183 [label="rel#183:RelSubset#2.NONE.[].any"]
        }
        subgraph cluster3{
                label="Set 3 RecordType(INTEGER id)";
                rel184 
[label="rel#184:HiveProject\ninput=RelSubset#183,inputs=0,synthetic=false\nrows=4.0,
 cost={inf}",shape=box]
                subset185 [label="rel#185:RelSubset#3.HIVE.[].any"]
        }
        root -> subset185;
        subset180 -> rel111[color=blue];
        subset181 -> rel139[color=blue];
        subset183 -> rel182; rel182 -> subset180[label="0"]; rel182 -> 
subset181[label="1"];
        subset185 -> rel184; rel184 -> subset183;
}
        at 
org.apache.calcite.plan.volcano.RelSubset$CheapestPlanReplacer.visit(RelSubset.java:742)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.calcite.plan.volcano.RelSubset.buildCheapestPlan(RelSubset.java:365) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:520)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.applyMaterializedViewRewriting(CalcitePlanner.java:2058)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1722)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1591)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.calcite.tools.Frameworks.lambda$withPlanner$0(Frameworks.java:131) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:914)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:180) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:126) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1343)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:570)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12820)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:465)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:326)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:180)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:326)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:224) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:107) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:519) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:471) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:436) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:430) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:121)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:227) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:257) 
~[hive-cli-4.0.0-SNAPSHOT.jar:?]
        at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201) 
~[hive-cli-4.0.0-SNAPSHOT.jar:?]
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127) 
~[hive-cli-4.0.0-SNAPSHOT.jar:?]
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:425) 
~[hive-cli-4.0.0-SNAPSHOT.jar:?]
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:356) 
~[hive-cli-4.0.0-SNAPSHOT.jar:?]
        at 
org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:733) 
~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:703) 
~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:115)
 ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157) 
~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
 ~[test-classes/:?]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:1.8.0_261]
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_261]
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_261]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_261]
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
 ~[junit-4.13.2.jar:4.13.2]
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 ~[junit-4.13.2.jar:4.13.2]
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
 ~[junit-4.13.2.jar:4.13.2]
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 ~[junit-4.13.2.jar:4.13.2]
        at 
org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:135) 
~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
~[junit-4.13.2.jar:4.13.2]
        at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
 ~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 
~[junit-4.13.2.jar:4.13.2]
        at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
 ~[junit-4.13.2.jar:4.13.2]
        at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
 ~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.Suite.runChild(Suite.java:128) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.Suite.runChild(Suite.java:27) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 
~[junit-4.13.2.jar:4.13.2]
        at 
org.apache.hadoop.hive.cli.control.CliAdapter$1$1.evaluate(CliAdapter.java:95) 
~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at org.junit.rules.RunRules.evaluate(RunRules.java:20) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
~[junit-4.13.2.jar:4.13.2]
        at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 
~[junit-4.13.2.jar:4.13.2]
        at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
 ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
        at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
 ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
        at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
 ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
        at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) 
~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
        at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:377)
 ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
        at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:138) 
~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
        at 
org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:465) 
~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
        at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:451) 
~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
{noformat}
The goal of this ticket is to get rid of the redundant correlation to avoid 
compilation failures but also for unlocking further simplifications and 
improving plan readability.

The plan can be simplified further based on the following observations.

If the right side of the correlate is empty then the whole correlate is empty 
when joinType is SEMI/INNER. Moreover if correlate type is LEFT then we can 
also drop the correlate and use t1 padded with nulls for the right side. 
Lastly, if the type is ANTI then result is the entire t1 so the correlate can 
also be dropped. RIGHT and FULL correlations are invalid and should never 
appear in the plan.

If the left side of the correlate is empty the result is empty and the 
correlation can be dropped for every legal joinType (INNER/SEMI/ANTI/LEFT).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to