[GitHub] [spark] EnricoMi commented on pull request #36150: [WIP][SPARK-38864][SQL] Add melt / unpivot to Dataset

GitBox Wed, 13 Apr 2022 03:02:47 -0700


EnricoMi commented on PR #36150:
URL: https://github.com/apache/spark/pull/36150#issuecomment-1097849432


   This can be done through existing expressions. The most common approaches 
available on the Web are:
   
   ```scala
   val df = Seq((1,2,3),(2,3,4)).toDF("id", "var1", "var2")
   
   
   // via expr("stack(...)")
   df.select($"id", expr("stack(2, 'var1', var1, 'var2', var2) as (variable, 
value)")).show()
   
   
   // via explode(array(struct))
   df.select($"id", explode(array(struct(lit("var1"), 
col("var1").cast(IntegerType)), struct(lit("var2"), 
col("var2").cast(IntegerType)))))
     .withColumn("variable", $"col.col1")
     .withColumn("value", $"col.col2")
     .drop("col")
     .show()
   
   
   // via flatMap
   import org.apache.spark.sql.types._
   val schema = StructType(Seq(StructField("id", IntegerType), 
StructField("variable", StringType), StructField("value", IntegerType)))
   
   import org.apache.spark.sql.catalyst.encoders.RowEncoder
   implicit val encoder = RowEncoder(schema)
   
   import org.apache.spark.sql.Row
   df.flatMap(row => Iterator(Row(row.getInt(0), "var1", row.getInt(1)), 
Row(row.getInt(0), "var2", row.getInt(2)))).show()
   ```
   
   With wide tables, these approaches become very verbose and code becomes 
unreadable. Instead of saying **what** you are doing with your DataFrame:
   
   ```scala
   // the proposed melt / unpivot
   df.melt(Seq("id"), Seq("var1", "var2")).show()
   ```
   
   you are saying how to transform your dataset.
   
   Further, melt / unpivot are very generic and well-defined operations in 
other systems like 
[Pandas](https://pandas.pydata.org/docs/reference/api/pandas.melt.html), 
[BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unpivot_operator),
 
[T-SQL](https://docs.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-ver15#unpivot-example),
 [Oracle](https://www.oracletutorial.com/oracle-basics/oracle-unpivot/).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] EnricoMi commented on pull request #36150: [WIP][SPARK-38864][SQL] Add melt / unpivot to Dataset

Reply via email to