EnricoMi commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r974938699
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -869,26 +873,55 @@ class Analyzer(override val catalogManager:
CatalogManager)
def apply(plan: LogicalPlan): LogicalPlan =
plan.resolveOperatorsWithPruning(
_.containsPattern(UNPIVOT), ruleId) {
- // once children and ids are resolved, we can determine values, if non
were given
- case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) &&
up.values.isEmpty =>
- up.copy(values = up.child.output.diff(up.ids))
+ // once children are resolved, we can determine values from ids and vice
versa
+ // if only either is given
+ case up: Unpivot if up.childrenResolved &&
+ up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty =>
+ up.copy(values =
+ Some(
+ up.child.output.diff(up.ids.get.flatMap(_.references))
Review Comment:
In contrast to other databases, the Spark implementation allows for
expressions rather than column names:
```scala
df.unpivot(
Array($"id"),
Array($"struct.val", $"struct.time".cast(DateType).as("date")),
"column",
"value"
)
```
```SQL
SELECT *
FROM df
UNPIVOT (
value FOR column in (struct.val, CAST(struct.time AS DATE) AS date)
)
```
This complicates identifying the unpivoted (value) columns in order to
derive the id columns. [Ignoring references here will add columns used in
unpivot expressions to the id
columns](https://github.com/apache/spark/pull/37407/commits/de19ba0bb33b6369a339a9578173e628102332ae#diff-f306d1dfcd20c4cdcdb5d631416ee9d92bf6893422b0a9683ca4493d7215e5e8):
```SQL
SELECT * FROM courseEarnings
UNPIVOT (
earningsYear FOR year IN (`2012` as `twenty-twelve`, `2013` as
`twenty-thirteen`, `2014` as `twenty-fourteen`)
)
-- !query schema
struct<course:string,2012:int,2013:int,2014:int,year:string,earningsYear:int>
-- !query output
Java 20000 30000 NULL twenty-fourteen NULL
Java 20000 30000 NULL twenty-thirteen 30000
Java 20000 30000 NULL twenty-twelve 20000
dotNET 15000 48000 22500 twenty-fourteen 22500
dotNET 15000 48000 22500 twenty-thirteen 48000
dotNET 15000 48000 22500 twenty-twelve 15000
```
This is not a problem for the Scala API as we always provide id columns
there. But in the SQL API, this gets messy.
Maybe we should disallow expressions for unpivot columns (ids and values).
Those expression can be expressed in the FROM clause.
But then the Scala API would also need to abandon expressions, to be like
the SQL API. Or would you say they don't need to match in this sense?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]