Hi,

I ma having issues trying to rename or move subcolumns when they are insdie
a repeated structure.

Given a certain schema, I can create a different layout to provide an
alternative view. For exaple, I can move one column and put it inside a
subcolumn, and add an extra literal field, just for fun:

import org.apache.spark.sql.{DataFrame, Column}
import org.apache.spark.sql.functions
import sqlContext.implicits._

case class Level0ArrayStruct(
  level_0_array_a: String,
  level_0_array_b: Int)

case class Level1ArrayStruct(
  level_1_array_a: String,
  level_1_array_b: Int)

case class Level1Struct(
  level_1_a: String,
  level_1_b: Int)

case class Level0Struct(
  level_0_a: String,
  level_0_b: Int,
  level_0_array: Seq[Level0ArrayStruct],
  level_0_struct: Level1Struct)

val example = sc.parallelize(
  Seq(Level0Struct(
    "level 0 a", 0,
    Seq(
      Level0ArrayStruct("level 0 array a 1", 1),
      Level0ArrayStruct("level 0 array a 2", 2)),
    Level1Struct(
      "level 1 a", 3)))).toDF


*scala> example.printSchema*root
 |-- level_0_a: string (nullable = true)
 |-- level_0_b: integer (nullable = false)
 |-- level_0_array: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- level_0_array_a: string (nullable = true)
 |    |    |-- level_0_array_b: integer (nullable = false)
 |-- level_0_struct: struct (nullable = true)
 |    |-- level_1_a: string (nullable = true)
 |    |-- level_1_b: integer (nullable = false)


*scala> example.withColumn("level_0_struct",
functions.struct($"level_0_struct.level_1_a", $"level_0_struct.level_1_b",
$"level_0_b",
functions.lit("foo").as("foo"))).drop("level_0_b").printSchema*root
 |-- level_0_a: string (nullable = true)
 |-- level_0_array: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- level_0_array_a: string (nullable = true)
 |    |    |-- level_0_array_b: integer (nullable = false)
 |-- level_0_struct: struct (nullable = false)
 |    |-- level_1_a: string (nullable = true)
 |    |-- level_1_b: integer (nullable = true)

* |    |-- level_0_b: integer (nullable = false) |    |-- foo: string
(nullable = false)*

However, I don't find a way to reliably deal with the struct inside
level_0_array.
If I try to move any of its fields to anywhere (including that
array column) they become an array column themselves, and I don't know how
to
reassemble ("zip") them together in a struct.

Say I want to add the same literal "foo", but this time inside level_0_array
,
for all the rows there. The resulting schema would be:

scala> example.withColumn("level_0_array",
functions.struct($"level_0_array.level_0_array_a",
$"level_0_array.level_0_array_b",
functions.lit("foo").as("foo"))).printSchema
root
 |-- level_0_a: string (nullable = true)
 |-- level_0_b: integer (nullable = false)
 |-- level_0_array: struct (nullable = false)





* |    |-- level_0_array_a: array (nullable = true) |    |    |-- element:
string (containsNull = true) |    |-- level_0_array_b: array (nullable =
true) |    |    |-- element: integer (containsNull = true) |    |-- foo:
string (nullable = false)* |-- level_0_struct: struct (nullable = true)
 |    |-- level_1_a: string (nullable = true)
 |    |-- level_1_b: integer (nullable = false)

The same problem applies if I tried to rename the fields, they become array
columns.

Is there any way to recursively manipulate repeated columns without
completely breaking their structure into individually repeated fields?

Best
-- 
Samuel

Reply via email to