[GitHub] [spark] xinrong-meng commented on a diff in pull request #36150: [SPARK-38864][SQL] Add unpivot / melt to Dataset

GitBox Mon, 18 Jul 2022 09:53:28 -0700


xinrong-meng commented on code in PR #36150:
URL: https://github.com/apache/spark/pull/36150#discussion_r923599373



##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -2013,6 +2013,143 @@ class Dataset[T] private[sql](
   @scala.annotation.varargs
   def agg(expr: Column, exprs: Column*): DataFrame = groupBy().agg(expr, exprs 
: _*)
 
+  /**
+   * Unpivot a DataFrame from wide format to long format, optionally leaving 
identifier columns set.
+   * This is the reverse to `groupBy(...).pivot(...).agg(...)`, except for the 
aggregation,
+   * which cannot be reversed.
+   *
+   * This function is useful to massage a DataFrame into a format where some
+   * columns are identifier columns ("ids"), while all other columns ("values")
+   * are "unpivoted" to the rows, leaving just two non-id columns, named as 
given
+   * by `variableColumnName` and `valueColumnName`.
+   *
+   * {{{
+   *   val df = Seq((1, 11, 12L), (2, 21, 22L)).toDF("id", "int", "long")
+   *   df.show()
+   *   // output:
+   *   // +---+---+----+
+   *   // | id|int|long|
+   *   // +---+---+----+
+   *   // |  1| 11|  12|
+   *   // |  2| 21|  22|
+   *   // +---+---+----+
+   *
+   *   df.unpivot(Array($"id"), Array($"int", $"long"), "variable", 
"value").show()
+   *   // output:
+   *   // +---+--------+-----+
+   *   // | id|variable|value|
+   *   // +---+--------+-----+
+   *   // |  1|     int|   11|
+   *   // |  1|    long|   12|
+   *   // |  2|     int|   21|
+   *   // |  2|    long|   22|
+   *   // +---+--------+-----+
+   *   // schema:
+   *   //root
+   *   // |-- id: integer (nullable = false)
+   *   // |-- variable: string (nullable = false)
+   *   // |-- value: long (nullable = true)
+   * }}}
+   *
+   * When no "id" columns are given, the unpivoted DataFrame consists of only 
the
+   * "variable" and "value" columns.
+   *
+   * All "value" columns must share a least common data type. Unless they are 
the same data type,
+   * all "value" columns are cast to the nearest common data type. For 
instance,
+   * types `IntegerType` and `LongType` are cast to `LongType`, while 
`IntegerType` and `StringType`
+   * do not have a common data type and `unpivot` fails.
+   *
+   * @param ids Id columns
+   * @param values Value columns to unpivot
+   * @param variableColumnName Name of the variable column

Review Comment:
   I think it fine since it follows pandas `varaible/value` and has the 
docstring below to explain:
   
   "leaving just two non-id columns, named as given by `variableColumnName` and 
`valueColumnName`."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] xinrong-meng commented on a diff in pull request #36150: [SPARK-38864][SQL] Add unpivot / melt to Dataset

Reply via email to