Jia Fan created SPARK-43838:
-------------------------------
Summary: Subquery on single table with having clause can't be
optimized
Key: SPARK-43838
URL: https://issues.apache.org/jira/browse/SPARK-43838
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.4.0
Reporter: Jia Fan
Eg:
{code:java}
sql("create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)")
sql("select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 " +
"having cnt = 0) from t t1").show() {code}
The error will throw:
{code:java}
[PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery in
batch Operator Optimization before Inferring Filters generated an invalid plan:
The plan becomes unresolved: 'Project [toprettystring(c1#224,
Some(America/Los_Angeles)) AS toprettystring(c1)#238, toprettystring(c2#225,
Some(America/Los_Angeles)) AS toprettystring(c2)#239, toprettystring(cnt#246L,
Some(America/Los_Angeles)) AS toprettystring(scalarsubquery(c1))#240]
+- 'Project [c1#224, c2#225, CASE WHEN isnull(alwaysTrue#245) THEN 0 WHEN NOT
(cnt#222L = 0) THEN null ELSE cnt#222L END AS cnt#246L]
+- 'Join LeftOuter, (c1#224 = c1#224#244)
:- Project [col1#226 AS c1#224, col2#227 AS c2#225]
: +- LocalRelation [col1#226, col2#227]
+- Project [cnt#222L, c1#224#244, cnt#222L, c1#224, true AS
alwaysTrue#245]
+- Project [cnt#222L, c1#224 AS c1#224#244, cnt#222L, c1#224]
+- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
+- Project [col1#228 AS c1#224]
+- LocalRelation [col1#228, col2#229]The previous plan:
Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS
toprettystring(c1)#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS
toprettystring(c2)#239, toprettystring(scalar-subquery#223 [c1#224 && (c1#224 =
c1#224#244)], Some(America/Los_Angeles)) AS
toprettystring(scalarsubquery(c1))#240]
: +- Project [cnt#222L, c1#224 AS c1#224#244]
: +- Filter (cnt#222L = 0)
: +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
: +- Project [col1#228 AS c1#224]
: +- LocalRelation [col1#228, col2#229]
+- Project [col1#226 AS c1#224, col2#227 AS c2#225]
+- LocalRelation [col1#226, col2#227] {code}
The reason are when execute subquery decorrelation, the fields in the subquery
but not in having clause are wrongly pull up. This problem only occurs when
there contain having clause.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]