[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

GitBox Mon, 19 Sep 2022 23:36:32 -0700


EnricoMi commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r974938699



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: 
CatalogManager)
     def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
       _.containsPattern(UNPIVOT), ruleId) {
 
-      // once children and ids are resolved, we can determine values, if non 
were given
-      case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && 
up.values.isEmpty =>
-        up.copy(values = up.child.output.diff(up.ids))
+      // once children are resolved, we can determine values from ids and vice 
versa
+      // if only either is given
+      case up: Unpivot if up.childrenResolved &&
+        up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty =>
+        up.copy(values =
+          Some(
+            up.child.output.diff(up.ids.get.flatMap(_.references))

Review Comment:
   In contrast to other databases, the Spark implementation allows for 
expressions rather than column names:
   
   ```scala
   df.unpivot(
     Array($"id"),
     Array($"struct.val", $"struct.time".cast(DateType).as("date")),
     "column",
     "value"
   )
   ```
   
   ```SQL
   SELECT *
   FROM df
   UNPIVOT (
     value FOR column in (struct.val, CAST(struct.time AS DATE) AS date)
   )
   ```
   
   This complicates identifying the unpivoted (value) columns in order to 
derive the id columns. [Ignoring references here will add columns used in 
unpivot expressions to the id 
columns](https://github.com/apache/spark/pull/37407/commits/de19ba0bb33b6369a339a9578173e628102332ae#diff-f306d1dfcd20c4cdcdb5d631416ee9d92bf6893422b0a9683ca4493d7215e5e8):
   
   ```SQL
   SELECT * FROM courseEarnings
   UNPIVOT (
     earningsYear FOR year IN (`2012` as `twenty-twelve`, `2013` as 
`twenty-thirteen`, `2014` as `twenty-fourteen`)
   )
   
   -- !query schema
   struct<course:string,2012:int,2013:int,2014:int,year:string,earningsYear:int>
   
   -- !query output
   Java 20000   30000   NULL    twenty-fourteen NULL
   Java 20000   30000   NULL    twenty-thirteen 30000
   Java 20000   30000   NULL    twenty-twelve   20000
   dotNET       15000   48000   22500   twenty-fourteen 22500
   dotNET       15000   48000   22500   twenty-thirteen 48000
   dotNET       15000   48000   22500   twenty-twelve   15000
   ```
   
   This is not a problem for the Scala API as we always provide id columns 
there. But in the SQL API, this gets messy.
   
   Maybe we should disallow expressions for unpivot columns (ids and values). 
Those expression can be expressed in the FROM clause.
   
   But then the Scala API would also need to abandon expressions, to be like 
the SQL API. Or would you say they don't need to match in this sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

Reply via email to