Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9005 )
Change subject: IMPALA-6314: Add run time scalar subquery check for uncorrelated subqueries ...................................................................... Patch Set 9: (22 comments) Thanks for the comments! I could resolve most of them, but still have some questions. http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.h File be/src/exec/cardinality-check-node.h: http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.h@25 PS9, Line 25: namespace impala { > Side note: I could hot a DCHECK with this query: This query raises the cardinality error. I also added this query to QueryTest/subquery.test http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.h@29 PS9, Line 29: /// Currently it is only used to check <= 1 cardinality of the child. > I cannot think of a scenario where we would need to check for a cardinality No, I just wanted make it more generic. Done. http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.h@38 PS9, Line 38: virtual Status Open(RuntimeState* state) override; > Thanks for your attempt and details. The following ideas might help: Thanks for the tips. However, I still couldn't succeed. I used the following query: select 1 from (values(1 a)) v where a = (select id from functional.alltypestiny) limit 1 Plan is: PLAN-ROOT SINK 03:HASH JOIN [RIGHT SEMI JOIN] |--00:UNION | constant-operands=1 02:CARDINALITY CHECK 01:SCAN HDFS [functional.alltypestiny] But, PartitionedHashJoinNode calls GetNext() twice on its child. First, it calls in Open() -> BlockingJoinNode::GetFirstProbeRow(state) -> CardinalityCheckNode::GetNext(). Then, in PartitionedHashJoinNode::GetNext() it starts to process the rows in the current rowbatch (fetched by Open()). After that it calls NextProbeRowBatch(), which calls GetNext() on CardinalityCheckNode, and this time it raises the error. So as far as I can tell the followings have to go wrong to cause the bug: - change the behavior of PartitionedHashJoinNode - make CardinalityCheckNode a streaming node despite of the comments that warn you not to do that - batch size happens to be zero Anyway, I added an e2e to hs2/test_fetch.py, although I couldn't make it fail. However, if CardinalityCheckNode is in a subplan, the following query succeeds (if batch_size=1, with default batch size it is flaky): select c_custkey from customer c where c_custkey < (select o_orderkey from c.c_orders where o_orderkey in (4285920, 6000000, 5000000))) limit 1; o_orderkey 4285920 and 6000000 belong to the same customer, while o_orderkey 5000000 belongs to a different customer. Without LIMIT 1 it fails. But, since the subquery refers to an identifier from the outer query I guess it counts as a correlated subquery, while this JIRA is only about uncorrelated subqueries. Anyway, it might be a good test for the uncorrelated subquery JIRA. http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.h@58 PS9, Line 58: // The associated SQL statement > for error reporting Done http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.cc File be/src/exec/cardinality-check-node.cc: http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.cc@37 PS9, Line 37: DCHECK_EQ(limit_, max_cardinality_); > Add comment explaining why we are not calling ExecNode::Prepare() we are calling it in L40. http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/ArithmeticExpr.java File fe/src/main/java/org/apache/impala/analysis/ArithmeticExpr.java: http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/ArithmeticExpr.java@189 PS9, Line 189: subquery.getStatement().setIsRuntimeScalar(true); > I don't think this approach is general enough to handle cases like the foll Extended StmtRewriter with a new static method called rewriteNonScalarSubqueries(). It takes an expr, and tries to rewrite the contained subqueries' type to scalar. Now these queries fail at runtime. They succeed if we add a WHERE clause with 'id = 1'. I added tests with these kind of queries. Do you have any other corner case in mind? http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java File fe/src/main/java/org/apache/impala/analysis/QueryStmt.java: http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java@94 PS9, Line 94: public boolean isRuntimeScalar() { return isRuntimeScalar_; } > please move these getters and setters to L365 to not clutter the members li Done http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java@99 PS9, Line 99: protected String origStmtStr_ = null; > * origSqlString_ Renamed. Added comment about it. http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java@101 PS9, Line 101: public void saveStatementStringIfOriginal() { > The correctness of this is pretty subtle and I think it would be easy to br Regarding to views, I added the following query to QueryTest/subquery.test: SELECT id FROM functional.alltypes WHERE id = (SELECT bigint_col FROM functional.alltypes_view) The test checks that the error message doesn't reveal the definition of alltypes_view. I extended ToSqlTest as well to test getOrigStmtStr() with some queries borrowed from PlannerTest/subquery-rewrite.test. http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/CardinalityCheckNode.java File fe/src/main/java/org/apache/impala/planner/CardinalityCheckNode.java: http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/CardinalityCheckNode.java@27 PS9, Line 27: * Node that checks the cardinality of its child, i.e. it aborts the query if the child > Simplify: Simplified statement. Now it is specialized to cardinality 1. http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java: http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@739 PS9, Line 739: Preconditions.checkState( > Check is confusing because the CardinalityCheckNode is only allowed to have Done http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@746 PS9, Line 746: // set the child explicitly, an ExchangeNode have been inserted > grammar 'have' added 'might' because in some cases we don't insert an ExchangeNode http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1116 PS9, Line 1116: if (inlineViewRef.getViewStmt().isRuntimeScalar()) { > Logically this belongs above L1112 because the SELECT node contains predica I moved it up. Actually the following query still succeeds, but I think it shouldn't: select 1 from functional.alltypessmall where id = 3 and id = (select id from functional.alltypes where id = 3 or id = 3962); The outer 'id = 3' is moved to the subquery, and it makes the subquery return only one row. It is a weird case, because I think we want these kind of optimizations, but they prevent correct cardinality checking. http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java: http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java@1164 PS9, Line 1164: // IMPALA-6314: Scalar subquery check is done at runtime, not at parse-time > remove JIRA prefix since this is a feature Done http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java@1165 PS9, Line 1165: AnalyzesOk("select id from functional.alltypestiny t where int_col = " + > use a few different comparison ops (not just "=") Done http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java@1168 PS9, Line 1168: "1 - (select int_col from functional.alltypessmall where id = 1)"); > * remove "where id = 1" or add other tests that show the query will obvious Done http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java@1170 PS9, Line 1170: "(select int_col from functional.alltypessmall where id = 1) * 1"); > use something other than * 1 because that might be optimized away in the fu Done http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test File testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test: http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test@1328 PS9, Line 1328: # Subquery in binary predicate that does cardinality check at parse-time > * Please move these tests to subquery.test where we specifically test subqu * I moved these tests to PlannerTest/subquery-rewrite.test * Do you mean I should use different queries than the ones in QueryTest/subquery.test? http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test@1432 PS9, Line 1432: | | max cardinality: 1 > How about we get rid of 'max cardinality' and only use the limit there will Done http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test@1476 PS9, Line 1476: 3 * (select id > Add a few tests with something more complex in the subquery that is not jus * Added join and grouping aggregation, union test to subquery-rewrite.test * Added nested subqueries tests to nested-collections.test http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test File testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test: http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test@674 PS9, Line 674: # IMPALA-6314: Add run time scalar subquery check for uncorrelated subqueries > * Add 1-2 planner tests that show the plan for queries like this. You can a * Added planner tests to nested-collections.test * Removed the IMPALA-6314 annotations http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-query/queries/QueryTest/subquery.test File testdata/workloads/functional-query/queries/QueryTest/subquery.test: http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-query/queries/QueryTest/subquery.test@920 PS9, Line 920: Query aborted:Subquery must return maximum 1 row(s): > Error is odd because today the only valid limit for the cardinality check n Done -- To view, visit http://gerrit.cloudera.org:8080/9005 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0f52b93a60eeacedd242a2f17fa6b99c4fc38e06 Gerrit-Change-Number: 9005 Gerrit-PatchSet: 9 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Vuk Ercegovac <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 31 Jan 2018 16:17:57 +0000 Gerrit-HasComments: Yes
