[GitHub] [spark] EnricoMi commented on a diff in pull request #36150: [SPARK-38864][SQL] Add melt / unpivot to Dataset

GitBox Mon, 13 Jun 2022 13:08:30 -0700


EnricoMi commented on code in PR #36150:
URL: https://github.com/apache/spark/pull/36150#discussion_r893842370



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -1382,6 +1417,12 @@ class Analyzer(override val catalogManager: 
CatalogManager)
       case g: Generate if containsStar(g.generator.children) =>
         throw 
QueryCompilationErrors.invalidStarUsageError("explode/json_tuple/UDTF",
           extractStar(g.generator.children))
+      // If the Melt ids contain Stars, expand them.

Review Comment:
   Should this be merged into a single `case`?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -524,6 +525,10 @@ class Analyzer(override val catalogManager: CatalogManager)
         if child.resolved && groupByOpt.isDefined && 
hasUnresolvedAlias(groupByOpt.get) =>
         Pivot(Some(assignAliases(groupByOpt.get)), pivotColumn, pivotValues, 
aggregates, child)
 
+      case m: Melt if m.child.resolved &&
+        (hasUnresolvedAlias(m.ids) || hasUnresolvedAlias(m.values)) =>
+        m.copy(ids = assignAliases(m.ids), values = assignAliases(m.values))

Review Comment:
   Any particular reason why the other `case`s do not copy here?



##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -2012,7 +2012,97 @@ class Dataset[T] private[sql](
   @scala.annotation.varargs
   def agg(expr: Column, exprs: Column*): DataFrame = groupBy().agg(expr, exprs 
: _*)
 
- /**
+  /**
+   * Unpivot a DataFrame from wide format to long format, optionally
+   * leaving identifier columns set.
+   *
+   * This function is useful to massage a DataFrame into a format where some
+   * columns are identifier columns ("ids"), while all other columns ("values")
+   * are "unpivoted" to the rows, leaving just two non-id columns, named as 
given
+   * by `variableColumnName` and `valueColumnName`.
+   *
+   * {{{
+   *   val df = Seq((1, 11, 12L), (2, 21, 22L)).toDF("id", "int", "long")
+   *   df.show()
+   *   // output:
+   *   // +---+---+----+
+   *   // | id|int|long|
+   *   // +---+---+----+
+   *   // |  1| 11|  12|
+   *   // |  2| 21|  22|
+   *   // +---+---+----+
+   *
+   *   df.melt(Array($"id"), Array($"int", $"long"), "variable", 
"value").show()
+   *   // output:
+   *   // +---+--------+-----+
+   *   // | id|variable|value|
+   *   // +---+--------+-----+
+   *   // |  1|     int|   11|
+   *   // |  1|    long|   12|
+   *   // |  2|     int|   21|
+   *   // |  2|    long|   22|
+   *   // +---+--------+-----+
+   *   // schema:
+   *   //root
+   *   // |-- id: integer (nullable = false)
+   *   // |-- variable: string (nullable = false)
+   *   // |-- value: long (nullable = true)
+   * }}}
+   *
+   * When no "id" columns are given, the unpivoted DataFrame consists of only 
the
+   * "variable" and "value" columns.
+   *
+   * All "value" columns must be of compatible data type. If they are not the 
same data type,
+   * all "value" columns are cast to the nearest common data type. For 
instance,
+   * types `IntegerType` and `LongType` are compatible and cast to `LongType`,
+   * while `IntegerType` and `StringType` are not compatible and `melt` fails.
+   *
+   * @param ids Id columns
+   * @param values Value columns to melt
+   * @param variableColumnName Name of the variable column
+   * @param valueColumnName Name of the value column
+   *
+   * @group untypedrel
+   * @since 3.4.0
+   */
+  def melt(

Review Comment:
   should there also be a `unpivot` alias?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##########
@@ -736,6 +736,21 @@ abstract class TypeCoercionBase {
     }
   }
 
+  /**
+   * Determines the value type of a [[Melt]].
+   */
+  object MeltCoercion extends Rule[LogicalPlan] {
+    override def apply(plan: LogicalPlan): LogicalPlan =
+      plan resolveOperators {
+        case m: Melt if m.values.nonEmpty && m.values.forall(_.resolved) && 
m.valueType.isEmpty =>
+          val valueDataType = 
findWiderTypeWithoutStringPromotion(m.values.map(_.dataType))

Review Comment:
   What if users prefer the wider `findWiderCommonType` method? I personally 
don't. Would require to wire in an extra argument to `melt`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] EnricoMi commented on a diff in pull request #36150: [SPARK-38864][SQL] Add melt / unpivot to Dataset

Reply via email to