[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21699#discussion_r208479881 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -371,25 +426,14 @@ class RelationalGroupedDataset protected[sql]( /** * (Java-specific) Pivots a column of the current `DataFrame` and performs the specified - * aggregation. - * - * There are two versions of pivot function: one that requires the caller to specify the list - * of distinct values to pivot on, and one that does not. The latter is more concise but less - * efficient, because Spark needs to first compute the list of distinct values internally. + * aggregation. This is an overloaded version of the `pivot` method with `pivotColumn` of + * the `String` type. * - * {{{ - * // Compute the sum of earnings for each year by course with each course as a separate column - * df.groupBy("year").pivot("course", Arrays.asList("dotNET", "Java")).sum("earnings"); - * - * // Or without specifying column values (less efficient) - * df.groupBy("year").pivot("course").sum("earnings"); - * }}} - * - * @param pivotColumn Name of the column to pivot. + * @param pivotColumn the column to pivot. * @param values List of values that will be translated to columns in the output DataFrame. - * @since 1.6.0 + * @since 2.4.0 */ - def pivot(pivotColumn: String, values: java.util.List[Any]): RelationalGroupedDataset = { + def pivot(pivotColumn: Column, values: java.util.List[Any]): RelationalGroupedDataset = { --- End diff -- If there's a plan for auditing it in 3.0.0, I am okay with going ahead with `Column` but thing is, we should deprecate them first. For the current status, I think the problem here is, this is an overloaded version of pivot and wouldn't necessarily make the differences. I used `pivot` heavily in the previous company before and I am pretty sure `pivot(String, actual values)` has been a common pattern and usage so far. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21699#discussion_r208478196 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -371,25 +426,14 @@ class RelationalGroupedDataset protected[sql]( /** * (Java-specific) Pivots a column of the current `DataFrame` and performs the specified - * aggregation. - * - * There are two versions of pivot function: one that requires the caller to specify the list - * of distinct values to pivot on, and one that does not. The latter is more concise but less - * efficient, because Spark needs to first compute the list of distinct values internally. + * aggregation. This is an overloaded version of the `pivot` method with `pivotColumn` of + * the `String` type. * - * {{{ - * // Compute the sum of earnings for each year by course with each course as a separate column - * df.groupBy("year").pivot("course", Arrays.asList("dotNET", "Java")).sum("earnings"); - * - * // Or without specifying column values (less efficient) - * df.groupBy("year").pivot("course").sum("earnings"); - * }}} - * - * @param pivotColumn Name of the column to pivot. + * @param pivotColumn the column to pivot. * @param values List of values that will be translated to columns in the output DataFrame. - * @since 1.6.0 + * @since 2.4.0 */ - def pivot(pivotColumn: String, values: java.util.List[Any]): RelationalGroupedDataset = { + def pivot(pivotColumn: Column, values: java.util.List[Any]): RelationalGroupedDataset = { --- End diff -- I think it's a bad idea to use `Any` in the API. For the existing ones we can't remove, but why should not add new ones using `Any`. In Spark 3.0 we should audit all the APIs in `functions.scala` and make them consistent(e.g. only use `Column` in the API) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21699 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21699#discussion_r207088444 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -339,29 +400,30 @@ class RelationalGroupedDataset protected[sql]( /** * Pivots a column of the current `DataFrame` and performs the specified aggregation. - * There are two versions of pivot function: one that requires the caller to specify the list - * of distinct values to pivot on, and one that does not. The latter is more concise but less - * efficient, because Spark needs to first compute the list of distinct values internally. + * This is an overloaded version of the `pivot` method with `pivotColumn` of the `String` type. * * {{{ * // Compute the sum of earnings for each year by course with each course as a separate column - * df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") - * - * // Or without specifying column values (less efficient) - * df.groupBy("year").pivot("course").sum("earnings") + * df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings") * }}} * - * @param pivotColumn Name of the column to pivot. + * @param pivotColumn the column to pivot. * @param values List of values that will be translated to columns in the output DataFrame. - * @since 1.6.0 + * @since 2.4.0 */ - def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = { + def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = { +import org.apache.spark.sql.functions.struct groupType match { case RelationalGroupedDataset.GroupByType => +val pivotValues = values.map { --- End diff -- I hope the last commit is reverted and we go ahead orthogonally if @maryannxue is happy with that too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21699#discussion_r207088374 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala --- @@ -246,4 +267,77 @@ class DataFramePivotSuite extends QueryTest with SharedSQLContext { checkAnswer(df.select($"a".cast(StringType)), Row(tsWithZone)) } } + + test("SPARK-24722: pivoting nested columns") { --- End diff -- I usually leave a JIRA number for one specific regression test when it's a bug since that's going to explicitly point out which case was broken .. but not a big deal though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21699#discussion_r206802439 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -339,29 +400,30 @@ class RelationalGroupedDataset protected[sql]( /** * Pivots a column of the current `DataFrame` and performs the specified aggregation. - * There are two versions of pivot function: one that requires the caller to specify the list - * of distinct values to pivot on, and one that does not. The latter is more concise but less - * efficient, because Spark needs to first compute the list of distinct values internally. + * This is an overloaded version of the `pivot` method with `pivotColumn` of the `String` type. * * {{{ * // Compute the sum of earnings for each year by course with each course as a separate column - * df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") - * - * // Or without specifying column values (less efficient) - * df.groupBy("year").pivot("course").sum("earnings") + * df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings") * }}} * - * @param pivotColumn Name of the column to pivot. + * @param pivotColumn the column to pivot. * @param values List of values that will be translated to columns in the output DataFrame. - * @since 1.6.0 + * @since 2.4.0 */ - def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = { + def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = { +import org.apache.spark.sql.functions.struct groupType match { case RelationalGroupedDataset.GroupByType => +val pivotValues = values.map { --- End diff -- @HyukjinKwon Should I revert the last commit and propose it as a separate PR? I think it makes sense to discuss in JIRA ticket possible alternatives for API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21699#discussion_r206732588 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -339,29 +400,30 @@ class RelationalGroupedDataset protected[sql]( /** * Pivots a column of the current `DataFrame` and performs the specified aggregation. - * There are two versions of pivot function: one that requires the caller to specify the list - * of distinct values to pivot on, and one that does not. The latter is more concise but less - * efficient, because Spark needs to first compute the list of distinct values internally. + * This is an overloaded version of the `pivot` method with `pivotColumn` of the `String` type. * * {{{ * // Compute the sum of earnings for each year by course with each course as a separate column - * df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") - * - * // Or without specifying column values (less efficient) - * df.groupBy("year").pivot("course").sum("earnings") + * df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings") * }}} * - * @param pivotColumn Name of the column to pivot. + * @param pivotColumn the column to pivot. * @param values List of values that will be translated to columns in the output DataFrame. - * @since 1.6.0 + * @since 2.4.0 */ - def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = { + def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = { +import org.apache.spark.sql.functions.struct groupType match { case RelationalGroupedDataset.GroupByType => +val pivotValues = values.map { --- End diff -- This PR just propose an overloaded version, pivot(column) of pivot(string). It's not necessary to fix other things together. Also, it need another review iteration I guess. For instance, does array or map work and nested struct, etc. Let's take out this change for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21699#discussion_r206732345 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -339,29 +400,30 @@ class RelationalGroupedDataset protected[sql]( /** * Pivots a column of the current `DataFrame` and performs the specified aggregation. - * There are two versions of pivot function: one that requires the caller to specify the list - * of distinct values to pivot on, and one that does not. The latter is more concise but less - * efficient, because Spark needs to first compute the list of distinct values internally. + * This is an overloaded version of the `pivot` method with `pivotColumn` of the `String` type. * * {{{ * // Compute the sum of earnings for each year by course with each course as a separate column - * df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") - * - * // Or without specifying column values (less efficient) - * df.groupBy("year").pivot("course").sum("earnings") + * df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings") * }}} * - * @param pivotColumn Name of the column to pivot. + * @param pivotColumn the column to pivot. * @param values List of values that will be translated to columns in the output DataFrame. - * @since 1.6.0 + * @since 2.4.0 */ - def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = { + def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = { +import org.apache.spark.sql.functions.struct groupType match { case RelationalGroupedDataset.GroupByType => +val pivotValues = values.map { --- End diff -- Hm? wait @maryannxue I think we shouldn't do this at least here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21699#discussion_r205669387 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -340,29 +399,23 @@ class RelationalGroupedDataset protected[sql]( /** * Pivots a column of the current `DataFrame` and performs the specified aggregation. - * There are two versions of pivot function: one that requires the caller to specify the list --- End diff -- Shall we note this in `Column` API too, or note that this is an overloaded version of string's? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21699#discussion_r199742580 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -340,36 +340,52 @@ class RelationalGroupedDataset protected[sql]( /** * Pivots a column of the current `DataFrame` and performs the specified aggregation. - * There are two versions of pivot function: one that requires the caller to specify the list - * of distinct values to pivot on, and one that does not. The latter is more concise but less - * efficient, because Spark needs to first compute the list of distinct values internally. * * {{{ * // Compute the sum of earnings for each year by course with each course as a separate column - * df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") - * - * // Or without specifying column values (less efficient) - * df.groupBy("year").pivot("course").sum("earnings") + * df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings") * }}} * - * @param pivotColumn Name of the column to pivot. + * @param pivotColumn the column to pivot. * @param values List of values that will be translated to columns in the output DataFrame. - * @since 1.6.0 + * @since 2.4.0 */ - def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = { + def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = { --- End diff -- To make diffs smaller, can you move this under the signature `def pivot(pivotColumn: String, values: Seq[Any])`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/21699 [SPARK-24722][SQL] pivot() with Column type argument ## What changes were proposed in this pull request? In the PR, I propose column-based API for the `pivot()` function. It allows using of nested columns as the pivot column. Also this makes it consistent with how groupBy() works. ## How was this patch tested? I added new tests to `DataFramePivotSuite` and updated PySpark examples for the `pivot()` function. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 pivot-column Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21699.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21699 commit 889e9223510c821f359ee3ce5bec6ce2f746a027 Author: Maxim Gekk Date: 2018-07-02T17:09:19Z Adding pivot() which takes Column as its argument commit f736ea2bca27fee37281bcadb333e7b6bbcd6124 Author: Maxim Gekk Date: 2018-07-02T17:21:07Z Tests for new function commit 5e6822650f4781c343d477589bf252c37b8453c4 Author: Maxim Gekk Date: 2018-07-02T18:24:18Z the since tag is updated commit c82c3979065aba48536a743ebf3384f3c95b570c Author: Maxim Gekk Date: 2018-07-02T19:05:48Z Test for nested columns commit 7d0d2261cef4c66226cd59635603391faabf0046 Author: Maxim Gekk Date: 2018-07-02T20:38:02Z Python test for nested columns commit 0fdd11ff26b4f4ca3b79bdd116aaf1c558643698 Author: Maxim Gekk Date: 2018-07-02T21:02:58Z Adding ticket number to test's title --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org