Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22030#discussion_r208469262
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -403,20 +415,29 @@ class RelationalGroupedDataset protected[sql](
*
* {{{
* // Compute the sum of earnings for each year by course with each
course as a separate column
- * df.groupBy($"year").pivot($"course", Seq("dotNET",
"Java")).sum($"earnings")
+ * df.groupBy($"year").pivot($"course", Seq(lit("dotNET"),
lit("Java"))).sum($"earnings")
+ * }}}
+ *
+ * For pivoting by multiple columns, use the `struct` function to
combine the columns and values:
+ *
+ * {{{
+ * df
+ * .groupBy($"year")
+ * .pivot(struct($"course", $"training"), Seq(struct(lit("java"),
lit("Experts"))))
+ * .agg(sum($"earnings"))
* }}}
*
* @param pivotColumn the column to pivot.
* @param values List of values that will be translated to columns in
the output DataFrame.
* @since 2.4.0
*/
- def pivot(pivotColumn: Column, values: Seq[Any]):
RelationalGroupedDataset = {
+ def pivot(pivotColumn: Column, values: Seq[Column]):
RelationalGroupedDataset = {
--- End diff --
I think https://github.com/apache/spark/pull/22030#discussion_r208456164
makes perfect sense. We really don't need to make it complicated.
> having an explicit Seq[Column] type is less confusing and kind of tells
people by itself that we are now support complex types in pivot values.
My question was that it's from your speculation or actual feedback from
users since the original interface has existed for few years and I haven't seen
some complaints about this so far as far as I can tell.
It's okay if we clearly document this with some examples. It wouldn't
necessarily make some differences between same overloaded APIs.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]