Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22030#discussion_r208431249
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -403,20 +415,29 @@ class RelationalGroupedDataset protected[sql](
*
* {{{
* // Compute the sum of earnings for each year by course with each
course as a separate column
- * df.groupBy($"year").pivot($"course", Seq("dotNET",
"Java")).sum($"earnings")
+ * df.groupBy($"year").pivot($"course", Seq(lit("dotNET"),
lit("Java"))).sum($"earnings")
+ * }}}
+ *
+ * For pivoting by multiple columns, use the `struct` function to
combine the columns and values:
+ *
+ * {{{
+ * df
+ * .groupBy($"year")
+ * .pivot(struct($"course", $"training"), Seq(struct(lit("java"),
lit("Experts"))))
+ * .agg(sum($"earnings"))
* }}}
*
* @param pivotColumn the column to pivot.
* @param values List of values that will be translated to columns in
the output DataFrame.
* @since 2.4.0
*/
- def pivot(pivotColumn: Column, values: Seq[Any]):
RelationalGroupedDataset = {
+ def pivot(pivotColumn: Column, values: Seq[Column]):
RelationalGroupedDataset = {
--- End diff --
Yea, I didn't mean to add another signature. My only worry is that
`pivot(String, Seq[Any])` can take actual values as well whereas `pivot(Column,
Seq[Column])` does not allow actual values, right?
I was thinking we should allow both cases for both APIs. Otherwise, it can
be confusing, isn't it? These differences should really be clarified.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]