Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9005 )

Change subject: IMPALA-6314: Add run time scalar subquery check for 
uncorrelated subqueries
......................................................................


Patch Set 9:

(21 comments)

http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.h
File be/src/exec/cardinality-check-node.h:

http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.h@29
PS9, Line 29: /// Currently it is only used to check <= 1 cardinality of the 
child.
I cannot think of a scenario where we would need to check for a cardinality 
other than 1, so to me it doesn't make sense to have code that allows that. Did 
you have something in mind for a non-one cardinality?

If not I suggest we make this node specific to cardinality 1 and simplify 
whatever we can.


http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.h@38
PS9, Line 38:   virtual Status Open(RuntimeState* state) override;
Tim and I had a discussion about this node and Tim brought up a very important 
non-trivial point:

This node *must* be blocking for correctness.

Streaming batches from the child in GetNext() could lead to wrong results, 
explained as follows.

Consider a join with a CardinalityCheckNode on the probe side. This can happen 
in practice due to join inversion. If the CardinalityCheckNode was streaming 
then its first batch might have a single row that is returned to the join. The 
join might produce a batch of results and return that to the client. The client 
may not fetch from the query anymore and think that those results are valid 
although they may not be. If the client had continued to fetch until eos then 
the CardinalityCheckNode might have returned an error.

This node must not return any rows until it has seem all batches from its child.

Please explain this critical point in the class comment.

Ideally, we should have a planner and end-to-end test for this scenario. Please 
reach out to me if you need help crafting a test scenario.


http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.h@58
PS9, Line 58:   // The associated SQL statement
for error reporting


http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.cc
File be/src/exec/cardinality-check-node.cc:

http://gerrit.cloudera.org:8080/#/c/9005/9/be/src/exec/cardinality-check-node.cc@37
PS9, Line 37:   DCHECK_EQ(limit_, max_cardinality_);
Add comment explaining why we are not calling ExecNode::Prepare()


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/ArithmeticExpr.java
File fe/src/main/java/org/apache/impala/analysis/ArithmeticExpr.java:

http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/ArithmeticExpr.java@189
PS9, Line 189:             subquery.getStatement().setIsRuntimeScalar(true);
I don't think this approach is general enough to handle cases like the 
following.

This query works fine today:
select 1 from functional.alltypes where coalesce(null, (select bool_col from 
functional.alltypes limit 1));

Remove the limit and the query won't compile.

Your fix should also work for arbitrary exprs that contain a single subquery.

Let's not litter this code in every Expr. We should handle it in a single place 
like the StmtRewriter.

Another interesting query is this one (which works today):
select 1 from functional.alltypes where (select int_col > 10 from 
functional.alltypes limit 1);


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java
File fe/src/main/java/org/apache/impala/analysis/QueryStmt.java:

http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java@94
PS9, Line 94:   public boolean isRuntimeScalar() { return isRuntimeScalar_; }
please move these getters and setters to L365 to not clutter the members list


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java@99
PS9, Line 99:   protected String origStmtStr_ = null;
* origSqlString_
* Needs a comment, in particular to explain how this is different from 
SelectStmt.sqlString_


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java@101
PS9, Line 101:   public void saveStatementStringIfOriginal() {
The correctness of this is pretty subtle and I think it would be easy to break 
in the future.

In particular, we should be sure not to reveal the SQL inside of views because 
the user may not be authorized to see that.

We should add a focused test in ToSqlTest or some other suitable place to guard 
against future changes that might beak this subtle behavior.


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/CardinalityCheckNode.java
File fe/src/main/java/org/apache/impala/planner/CardinalityCheckNode.java:

http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/CardinalityCheckNode.java@27
PS9, Line 27:  * Node that checks the cardinality of its child, i.e. it aborts 
the query if the child
Simplify:

Node that returns an error if its child has an actual cardinality greater than 
a given threshold.


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java:

http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@739
PS9, Line 739:     Preconditions.checkState(
Check is confusing because the CardinalityCheckNode is only allowed to have one 
child. Remove.


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@746
PS9, Line 746:     // set the child explicitly, an ExchangeNode have been 
inserted
grammar 'have'


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1116
PS9, Line 1116:     if (inlineViewRef.getViewStmt().isRuntimeScalar()) {
Logically this belongs above L1112 because the SELECT node contains predicates 
from enclosing blocks which must be evaluated after the cardinality check. The 
SELECT node may even filter some rows which would incorrectly make the 
cardinality check pass.

I don't think we can construct such a plan today (I tried), so we won't be able 
to test this.


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java:

http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java@1164
PS9, Line 1164:     // IMPALA-6314: Scalar subquery check is done at runtime, 
not at parse-time
remove JIRA prefix since this is a feature


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java@1165
PS9, Line 1165:     AnalyzesOk("select id from functional.alltypestiny t where 
int_col = " +
use a few different comparison ops (not just "=")


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java@1168
PS9, Line 1168:         "1 - (select int_col from functional.alltypessmall 
where id = 1)");
* remove "where id = 1" or add other tests that show the query will obviously 
fail at runtime
* add a test with limit 10 in the subquery so show that having a non-1 limit is 
ok


http://gerrit.cloudera.org:8080/#/c/9005/9/fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java@1170
PS9, Line 1170:         "(select int_col from functional.alltypessmall where id 
= 1) * 1");
use something other than * 1 because that might be optimized away in the future 
rendering this test useless


http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
File testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test:

http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test@1328
PS9, Line 1328: # Subquery in binary predicate that does cardinality check at 
parse-time
* Please move these tests to subquery.test where we specifically test 
subqueries that are rewritten (as opposed to just testing the inline-view 
functionality)
* Are the parse-time cases not already covered in subquery.test? Let's avoid 
adding redundant tests where we can.


http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test@1432
PS9, Line 1432: |  |  max cardinality: 1
How about we get rid of 'max cardinality' and only use the limit there will 
never be a case where their values are different. The different behavior is 
implied by the plan node type.


http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test@1476
PS9, Line 1476:   3 * (select id
Add a few tests with something more complex in the subquery that is not just a 
single scan, for example joins, grouping aggregation, union. Would be good to 
combine several such cases into a single big query, possibly even with nested 
subqueries of this sort.


http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
File 
testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test:

http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test@674
PS9, Line 674: # IMPALA-6314: Add run time scalar subquery check for 
uncorrelated subqueries
* Add 1-2 planner tests that show the plan for queries like this. You can add 
them in nested-collections.test
* We typically don't use the IMPALA-6314 prefix for new features, the 
annotation is more for regression tests of bugs


http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-query/queries/QueryTest/subquery.test
File testdata/workloads/functional-query/queries/QueryTest/subquery.test:

http://gerrit.cloudera.org:8080/#/c/9005/9/testdata/workloads/functional-query/queries/QueryTest/subquery.test@920
PS9, Line 920: Query aborted:Subquery must return maximum 1 row(s):
Error is odd because today the only valid limit for the cardinality check node 
is 1 and I don't anticipate future use case where that will ever change (do you 
have something in mind?). How about being maximally specific:

Subquery must not return more than one row.



--
To view, visit http://gerrit.cloudera.org:8080/9005
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0f52b93a60eeacedd242a2f17fa6b99c4fc38e06
Gerrit-Change-Number: 9005
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Vuk Ercegovac <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 25 Jan 2018 01:21:22 +0000
Gerrit-HasComments: Yes

Reply via email to