Re: Schema Evolution for nested Dataset[T]
Hi Michael, Thank you for the suggestions. I am wondering how I can make `withColumn` to handle nested structure? For example, below is my code to generate the data. I basically add the `age` field to `Person2`, which is nested in an Array for Course2. Then I want to fill in 0 for age with age is null. case class Person1(name: String) case class Person2(name: String, age: Int) case class Course1(id: Int, students: Array[Person1]) case class Course2(id: Int, students: Array[Person2]) Seq(Course1(10, Array(Person1("a"), Person1("b".toDF.write.parquet("data1") Seq(Course2(20, Array(Person2("c",20), Person2("d",10.toDF.write.parquet("data2") val allData = spark.read.option("mergeSchema", "true").parquet("data1", "data2") allData.show +---++ | id|students| +---++ | 20|[[c,20], [d,10]]| | 10|[[a,null], [b,null]]| +---++ *My first try:* allData.withColumn("students.age", coalesce($"students.age", lit(0))) It returns the exception: org.apache.spark.sql.AnalysisException: cannot resolve 'coalesce(`students`.`age`, 0)' due to data type mismatch: input to function coalesce should all be the same type, but it's [array, int];; *My second try: * allData.withColumn("students.age", coalesce($"students.age", array(lit(0), lit(0.show +---+++ | id|students|students.age| +---+++ | 20|[[c,20], [d,10]]|[20, 10]| | 10|[[a,null], [b,null]]|[null, null]| +---+++ It creates a new column "students.age" instead of imputing the value age nested in students. Thank you very much in advance. Mike On Mon, May 1, 2017 at 10:31 AM, Michael Armbrustwrote: > Oh, and if you want a default other than null: > > import org.apache.spark.sql.functions._ > df.withColumn("address", coalesce($"address", lit()) > > On Mon, May 1, 2017 at 10:29 AM, Michael Armbrust > wrote: > >> The following should work: >> >> val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema >> spark.read.schema(schema).parquet("data.parquet").as[Course] >> >> Note this will only work for nullable files (i.e. if you add a primitive >> like Int you need to make it an Option[Int]) >> >> On Sun, Apr 30, 2017 at 9:12 PM, Mike Wheeler < >> rotationsymmetr...@gmail.com> wrote: >> >>> Hi Spark Users, >>> >>> Suppose I have some data (stored in parquet for example) generated as >>> below: >>> >>> package com.company.entity.old >>> case class Course(id: Int, students: List[Student]) >>> case class Student(name: String) >>> >>> Then usually I can access the data by >>> >>> spark.read.parquet("data.parquet").as[Course] >>> >>> Now I want to add a new field `address` to Student: >>> >>> package com.company.entity.new >>> case class Course(id: Int, students: List[Student]) >>> case class Student(name: String, address: String) >>> >>> Then obviously running `spark.read.parquet("data.parquet").as[Course]` >>> on data generated by the old entity/schema will fail because `address` >>> is missing. >>> >>> In this case, what is the best practice to read data generated with >>> the old entity/schema to the new entity/schema, with the missing field >>> set to some default value? I know I can manually write a function to >>> do the transformation from the old to the new. But it is kind of >>> tedious. Any automatic methods? >>> >>> Thanks, >>> >>> Mike >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >
Re: Schema Evolution for nested Dataset[T]
Oh, and if you want a default other than null: import org.apache.spark.sql.functions._ df.withColumn("address", coalesce($"address", lit()) On Mon, May 1, 2017 at 10:29 AM, Michael Armbrustwrote: > The following should work: > > val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema > spark.read.schema(schema).parquet("data.parquet").as[Course] > > Note this will only work for nullable files (i.e. if you add a primitive > like Int you need to make it an Option[Int]) > > On Sun, Apr 30, 2017 at 9:12 PM, Mike Wheeler < > rotationsymmetr...@gmail.com> wrote: > >> Hi Spark Users, >> >> Suppose I have some data (stored in parquet for example) generated as >> below: >> >> package com.company.entity.old >> case class Course(id: Int, students: List[Student]) >> case class Student(name: String) >> >> Then usually I can access the data by >> >> spark.read.parquet("data.parquet").as[Course] >> >> Now I want to add a new field `address` to Student: >> >> package com.company.entity.new >> case class Course(id: Int, students: List[Student]) >> case class Student(name: String, address: String) >> >> Then obviously running `spark.read.parquet("data.parquet").as[Course]` >> on data generated by the old entity/schema will fail because `address` >> is missing. >> >> In this case, what is the best practice to read data generated with >> the old entity/schema to the new entity/schema, with the missing field >> set to some default value? I know I can manually write a function to >> do the transformation from the old to the new. But it is kind of >> tedious. Any automatic methods? >> >> Thanks, >> >> Mike >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: Schema Evolution for nested Dataset[T]
The following should work: val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema spark.read.schema(schema).parquet("data.parquet").as[Course] Note this will only work for nullable files (i.e. if you add a primitive like Int you need to make it an Option[Int]) On Sun, Apr 30, 2017 at 9:12 PM, Mike Wheelerwrote: > Hi Spark Users, > > Suppose I have some data (stored in parquet for example) generated as > below: > > package com.company.entity.old > case class Course(id: Int, students: List[Student]) > case class Student(name: String) > > Then usually I can access the data by > > spark.read.parquet("data.parquet").as[Course] > > Now I want to add a new field `address` to Student: > > package com.company.entity.new > case class Course(id: Int, students: List[Student]) > case class Student(name: String, address: String) > > Then obviously running `spark.read.parquet("data.parquet").as[Course]` > on data generated by the old entity/schema will fail because `address` > is missing. > > In this case, what is the best practice to read data generated with > the old entity/schema to the new entity/schema, with the missing field > set to some default value? I know I can manually write a function to > do the transformation from the old to the new. But it is kind of > tedious. Any automatic methods? > > Thanks, > > Mike > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Schema Evolution for nested Dataset[T]
Hi Spark Users, Suppose I have some data (stored in parquet for example) generated as below: package com.company.entity.old case class Course(id: Int, students: List[Student]) case class Student(name: String) Then usually I can access the data by spark.read.parquet("data.parquet").as[Course] Now I want to add a new field `address` to Student: package com.company.entity.new case class Course(id: Int, students: List[Student]) case class Student(name: String, address: String) Then obviously running `spark.read.parquet("data.parquet").as[Course]` on data generated by the old entity/schema will fail because `address` is missing. In this case, what is the best practice to read data generated with the old entity/schema to the new entity/schema, with the missing field set to some default value? I know I can manually write a function to do the transformation from the old to the new. But it is kind of tedious. Any automatic methods? Thanks, Mike - To unsubscribe e-mail: user-unsubscr...@spark.apache.org