GitHub user hvanhovell opened a pull request:
https://github.com/apache/spark/pull/14539
[SPARK-16947][SQL] Improve type coercion for inline tables.
## What changes were proposed in this pull request?
Inline tables were added in to Spark SQL in 2.0, e.g.: `select * from
values (1, 'A'), (2, 'B') as tbl(a, b)`
This is currently implemented using a `LocalRelation` and this relation is
created during parsing. This has a weakness: type coercion is based on the
first row in the relation, and all subsequent values are cast in to this type.
The latter violates the principle of least surprise.
This PR fixes this by creating an `Union` of (constant) `Project`s. As a
result type coercion now follows the rules for `Union`, which is similar to
other systems like PostgreSQL. In order to retain optimal speed, I have
extended the `ConvertToLocalRelation`, which makes sure the table gets
rewritten into a `LocalRelation` during optimization.
The following SQL statement:
```SQL
select * from values (1, exp(1)), (2.00, 'b'), (3.0, 'tt'), (40.0001, 4)
x(a, b)
```
... now yields the following plan:
```
== Parsed Logical Plan ==
'Project [*]
+- 'SubqueryAlias x
+- 'Union
:- 'Project [1 AS a#0, 'exp(1) AS b#1]
: +- OneRowRelation$
:- Project [2.00 AS a#2, b AS b#3]
: +- OneRowRelation$
:- Project [3.0 AS a#4, tt AS b#5]
: +- OneRowRelation$
+- Project [40.0001 AS a#6, 4 AS b#7]
+- OneRowRelation$
== Analyzed Logical Plan ==
a: decimal(14,4), b: string
Project [a#18, b#19]
+- SubqueryAlias x
+- Union
:- Project [cast(a#0 as decimal(14,4)) AS a#18, cast(b#1 as string)
AS b#19]
: +- Project [1 AS a#0, EXP(cast(1 as double)) AS b#1]
: +- OneRowRelation$
:- Project [cast(a#2 as decimal(14,4)) AS a#20, b#3]
: +- Project [2.00 AS a#2, b AS b#3]
: +- OneRowRelation$
:- Project [cast(a#4 as decimal(14,4)) AS a#21, b#5]
: +- Project [3.0 AS a#4, tt AS b#5]
: +- OneRowRelation$
+- Project [cast(a#6 as decimal(14,4)) AS a#22, cast(b#7 as string)
AS b#23]
+- Project [40.0001 AS a#6, 4 AS b#7]
+- OneRowRelation$
== Optimized Logical Plan ==
LocalRelation [a#18, b#19]
== Physical Plan ==
LocalTableScan [a#18, b#19]
```
... and the following result:
```
+-------+-----------------+
| a| b|
+-------+-----------------+
| 1.0000|2.718281828459045|
| 2.0000| b|
| 3.0000| tt|
|40.0001| 4|
+-------+-----------------+
```
## How was this patch tested?
Inline tables are converted into `Union`s during parsing. Type coercion for
unions is already covered by various test cases. I have updated the
`PlanParseSuite` to test the parsers the new output, and I have added tests to
the `ConvertToLocalRelationSuite` to tests if inline table (a-like) structures
are converted into `LocalRelation`s.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hvanhovell/spark SPARK-16947
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14539.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14539
----
commit 3b0a28b9626b2fac65688cf3a2702a96131e4307
Author: Herman van Hovell <[email protected]>
Date: 2016-08-08T11:10:38Z
Improve inline table processing.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]