EnricoMi commented on PR #36150:
URL: https://github.com/apache/spark/pull/36150#issuecomment-1097849432
This can be done through existing expressions. The most common approaches
available on the Web are:
```scala
val df = Seq((1,2,3),(2,3,4)).toDF("id", "var1", "var2")
// via expr("stack(...)")
df.select($"id", expr("stack(2, 'var1', var1, 'var2', var2) as (variable,
value)")).show()
// via explode(array(struct))
df.select($"id", explode(array(struct(lit("var1"),
col("var1").cast(IntegerType)), struct(lit("var2"),
col("var2").cast(IntegerType)))))
.withColumn("variable", $"col.col1")
.withColumn("value", $"col.col2")
.drop("col")
.show()
// via flatMap
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("id", IntegerType),
StructField("variable", StringType), StructField("value", IntegerType)))
import org.apache.spark.sql.catalyst.encoders.RowEncoder
implicit val encoder = RowEncoder(schema)
import org.apache.spark.sql.Row
df.flatMap(row => Iterator(Row(row.getInt(0), "var1", row.getInt(1)),
Row(row.getInt(0), "var2", row.getInt(2)))).show()
```
With wide tables, these approaches become very verbose and code becomes
unreadable. Instead of saying **what** you are doing with your DataFrame:
```scala
// the proposed melt / unpivot
df.melt(Seq("id"), Seq("var1", "var2")).show()
```
you are saying how to transform your dataset.
Further, melt / unpivot are very generic and well-defined operations in
other systems like
[Pandas](https://pandas.pydata.org/docs/reference/api/pandas.melt.html),
[BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unpivot_operator),
[T-SQL](https://docs.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-ver15#unpivot-example),
[Oracle](https://www.oracletutorial.com/oracle-basics/oracle-unpivot/).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]