Enrico Minack created SPARK-39292:
-------------------------------------
Summary: Make Dataset.melt work with struct fields
Key: SPARK-39292
URL: https://issues.apache.org/jira/browse/SPARK-39292
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.4.0
Reporter: Enrico Minack
In SPARK-38864, the melt function was added to Dataset.
It would be nice if fields of struct fields could be used as id and value
columns. This would allow for the following:
Given a Dataset with following schema:
{code:java}
root
|-- an: struct (nullable = false)
| |-- id: integer (nullable = false)
|-- str: struct (nullable = false)
| |-- one: string (nullable = true)
| |-- two: string (nullable = true)
{code}
For example:
{code:java}
+---+-------------+
| an| str|
+---+-------------+
|{1}| {one, One}|
|{2}| {two, null}|
|{3}|{null, three}|
|{4}| {null, null}|
+---+-------------+
{code}
Melting with value columns {{Seq("str.one", "str.two")}} on id columns
{{Seq("an.id")}} would result in
{code:java}
+--+--------+-----+
|an|variable|value|
+--+--------+-----+
| 1| str.one| one|
| 1| str.two| One|
| 2| str.one| two|
| 2| str.two| null|
| 3| str.one| null|
| 3| str.two|three|
| 4| str.one| null|
| 4| str.two| null|
+--+--------+-----+
{code}
See test in {{org.apache.spark.sql.MeltSuite}}:
{code:java}
test("melt with struct fields") {
val df = meltWideDataDs.select(
struct($"id").as("an"),
struct(
$"str1".as("one"),
$"str2".as("two")
).as("str")
)
checkAnswer(
Melt.of(df, Seq("an.id"), Seq("str.one", "str.two")),
meltedWideDataRows.map(row => Row(
row.getInt(0),
row.getString(1) match {
case "str1" => "str.one"
case "str2" => "str.two"
},
row.getString(2)
))
)
}
{code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]