Yerui Sun created SPARK-18622:
---------------------------------
Summary: Missing Reference in Multi Union Clauses Cause by
TypeCoercion
Key: SPARK-18622
URL: https://issues.apache.org/jira/browse/SPARK-18622
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.2, 1.6.3
Reporter: Yerui Sun
{code}
spark-sql> explain extended
> select a
> from
> (
> select 0 a, 0 b
> union all
> select sum(1) a, cast(0 as bigint) b
> union all
> select 0 a, 0 b
> )t;
== Parsed Logical Plan ==
'Project ['a]
+- 'SubqueryAlias t
+- 'Union
:- 'Union
: :- Project [0 AS a#0, 0 AS b#1]
: : +- OneRowRelation$
: +- 'Project ['sum(1) AS a#2, cast(0 as bigint) AS b#3L]
: +- OneRowRelation$
+- Project [0 AS a#4, 0 AS b#5]
+- OneRowRelation$
== Analyzed Logical Plan ==
a: int
Project [a#0]
+- SubqueryAlias t
+- Union
:- !Project [a#0, b#9L]
: +- Union
: :- Project [cast(a#0 as bigint) AS a#11L, b#9L]
: : +- Project [a#0, cast(b#1 as bigint) AS b#9L]
: : +- Project [0 AS a#0, 0 AS b#1]
: : +- OneRowRelation$
: +- Project [a#2L, b#3L]
: +- Project [a#2L, b#3L]
: +- Aggregate [sum(cast(1 as bigint)) AS a#2L, cast(0 as
bigint) AS b#3L]
: +- OneRowRelation$
+- Project [a#4, cast(b#5 as bigint) AS b#10L]
+- Project [0 AS a#4, 0 AS b#5]
+- OneRowRelation$
== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: resolved attribute(s) a#0 missing from
a#11L,b#9L in operator !Project [a#0, b#9L];;
Project [a#0]
+- SubqueryAlias t
+- Union
:- !Project [a#0, b#9L]
: +- Union
: :- Project [cast(a#0 as bigint) AS a#11L, b#9L]
: : +- Project [a#0, cast(b#1 as bigint) AS b#9L]
: : +- Project [0 AS a#0, 0 AS b#1]
: : +- OneRowRelation$
: +- Project [a#2L, b#3L]
: +- Project [a#2L, b#3L]
: +- Aggregate [sum(cast(1 as bigint)) AS a#2L, cast(0 as
bigint) AS b#3L]
: +- OneRowRelation$
+- Project [a#4, cast(b#5 as bigint) AS b#10L]
+- Project [0 AS a#4, 0 AS b#5]
+- OneRowRelation$
== Physical Plan ==
org.apache.spark.sql.AnalysisException: resolved attribute(s) a#0 missing from
a#11L,b#9L in operator !Project [a#0, b#9L];;
Project [a#0]
+- SubqueryAlias t
+- Union
:- !Project [a#0, b#9L]
: +- Union
: :- Project [cast(a#0 as bigint) AS a#11L, b#9L]
: : +- Project [a#0, cast(b#1 as bigint) AS b#9L]
: : +- Project [0 AS a#0, 0 AS b#1]
: : +- OneRowRelation$
: +- Project [a#2L, b#3L]
: +- Project [a#2L, b#3L]
: +- Aggregate [sum(cast(1 as bigint)) AS a#2L, cast(0 as
bigint) AS b#3L]
: +- OneRowRelation$
+- Project [a#4, cast(b#5 as bigint) AS b#10L]
+- Project [0 AS a#4, 0 AS b#5]
+- OneRowRelation$
{code}
Key Points to re-produce issue:
* 3 or more union clauses;
* One column is sum aggregate in one union clause, and is Integer type in other
union clause;
* Another column has different date types in union clauses;
The reason of issue:
- Step 1: Apply TypeCoercion.WidenSetOperationTypes, add project with cast
since the union clauses has different datatypes for one column; With 3 union
clauses, the inner union clause also be projected with cast;
- Step 2: Apply TypeCoercion.FunctionArgumentConversion, the return type of
sum(int) will be extended to BigInt, meaning one column in union clauses
changed datatype;
- Step 3: Apply TypeCoercion.WidenSetOperationTypes again, another cast project
added in inner union clause, since sum(int) datatype changed; at this point,
the reference of project ON inner union will be missed, since the project IN
inner union is newly added, see the Analyzed Logical Plan;
Solutions to fix:
* Since set operation type coercion should be applied after inner clause be
stabled, apply WidenSetOperationTypes at last will fix the issue;
* To avoiding multi level projects on set operation clause, handle the existing
cast project carefully in WidenSetOperationTypes should be also work;
Appreciate for any comments.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]