[ 
https://issues.apache.org/jira/browse/SPARK-26403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724839#comment-16724839
 ] 

ASF GitHub Bot commented on SPARK-26403:
----------------------------------------

HyukjinKwon opened a new pull request #23349: [SPARK-26403][SQL] Support 
pivoting using array column for `pivot(column)` API
URL: https://github.com/apache/spark/pull/23349
 
 
   ## What changes were proposed in this pull request?
   
   This PR fixes `Literal(..: Any)` can accepts 
`collection.mutable.WrappedArray` in order to `pivot(Column)` can accepts array 
column as well.
   
   We can unwrap the array and use it for type dispatch.
   
   ```scala
   val df = Seq(
     (2, Seq.empty[String]),
     (2, Seq("a", "x")),
     (3, Seq.empty[String]),
     (3, Seq("a", "x"))).toDF("x", "s")
   df.groupBy("x").pivot("s").count().show()
   ```
   
   Before:
   
   ```
   Unsupported literal type class scala.collection.mutable.WrappedArray$ofRef 
WrappedArray()
   java.lang.RuntimeException: Unsupported literal type class 
scala.collection.mutable.WrappedArray$ofRef WrappedArray()
        at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:80)
        at 
org.apache.spark.sql.RelationalGroupedDataset.$anonfun$pivot$2(RelationalGroupedDataset.scala:427)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
        at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:39)
        at scala.collection.TraversableLike.map(TraversableLike.scala:237)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at 
org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:425)
        at 
org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:406)
        at 
org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:317)
        at 
org.apache.spark.sql.DataFramePivotSuite.$anonfun$new$1(DataFramePivotSuite.scala:341)
        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   ```
   
   After:
   
   ```
   +---+---+------+
   |  x| []|[a, x]|
   +---+---+------+
   |  3|  1|     1|
   |  2|  1|     1|
   +---+---+------+
   ```
   
   ## How was this patch tested?
   
   Manually tested and unittests were added.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> DataFrame pivot using array column fails with "Unsupported literal type class"
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-26403
>                 URL: https://issues.apache.org/jira/browse/SPARK-26403
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Huon Wilson
>            Priority: Minor
>
> Doing a pivot (using the {{pivot(pivotColumn: Column)}} overload) on a column 
> containing arrays results in a runtime error:
> {code:none}
> scala> val df = Seq((1, Seq("a", "x"), 2), (1, Seq("b"), 3), (2, Seq("a", 
> "x"), 10), (3, Seq(), 100)).toDF("x", "s", "y")
> df: org.apache.spark.sql.DataFrame = [x: int, s: array<string> ... 1 more 
> field]
> scala> df.show
> +---+------+---+
> |  x|     s|  y|
> +---+------+---+
> |  1|[a, x]|  2|
> |  1|   [b]|  3|
> |  2|[a, x]| 10|
> |  3|    []|100|
> +---+------+---+
> scala> df.groupBy("x").pivot("s").agg(collect_list($"y")).show
> java.lang.RuntimeException: Unsupported literal type class 
> scala.collection.mutable.WrappedArray$ofRef WrappedArray()
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:419)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:397)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:317)
>   ... 49 elided
> {code}
> However, this doesn't seem to be a fundamental limitation with {{pivot}}, as 
> it works fine using the {{pivot(pivotColumn: Column, values: Seq[Any])}} 
> overload, as long as the arrays are mapped to the {{Array}} type:
> {code:none}
> scala> val rawValues = df.select("s").distinct.sort("s").collect
> rawValues: Array[org.apache.spark.sql.Row] = Array([WrappedArray()], 
> [WrappedArray(a, x)], [WrappedArray(b)])
> scala> val values = rawValues.map(_.getSeq[String](0).to[Array])
> values: Array[Array[String]] = Array(Array(), Array(a, x), Array(b))
> scala> df.groupBy("x").pivot("s", values).agg(collect_list($"y")).show
> +---+-----+------+---+
> |  x|   []|[a, x]|[b]|
> +---+-----+------+---+
> |  1|   []|   [2]|[3]|
> |  3|[100]|    []| []|
> |  2|   []|  [10]| []|
> +---+-----+------+---+
> {code}
> It would be nice if {{pivot}} was more resilient to Spark's own 
> representation of array columns, and so the first version worked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to