Michel Lemay created SPARK-19980: ------------------------------------ Summary: Basic Dataset transformation on POJOs does not preserves nulls. Key: SPARK-19980 URL: https://issues.apache.org/jira/browse/SPARK-19980 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Michel Lemay
Applying an identity map transformation on a statically typed Dataset with a POJO produces an unexpected result. Given POJOs: {code} public class Stuff implements Serializable { private String name; public void setName(String name) { this.name = name; } public String getName() { return name; } } public class Outer implements Serializable { private String name; private Stuff stuff; public void setName(String name) { this.name = name; } public String getName() { return name; } public void setStuff(Stuff stuff) { this.stuff = stuff; } public Stuff getStuff() { return stuff; } } {code} And test code: {code} val encoder = Encoders.bean(classOf[Outer]) val schema = encoder.schema schema.printTreeString val df = spark.read.schema(schema).json("d:\\stuff.json").as[Outer](encoder) df.show() df.map(x => x)(encoder).show() {code} Produces the result: {code} scala> val encoder = Encoders.bean(classOf[Outer]) encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, stuff[0]: struct<name:string>] scala> val schema = encoder.schema schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) | |-- name: string (nullable = true) scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder) df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: struct<name: string>] scala> df.show() +----+-----+ |name|stuff| +----+-----+ | v1| null| +----+-----+ scala> df.map(x => x)(encoder).show() +----+------+ |name| stuff| +----+------+ | v1|[null]| +----+------+ {code} After identity transformation, `stuff` becomes an object with null values inside it instead of staying null itself. Doing the same with case classes preserves the nulls: {code} scala> case class ScalaStuff(name: String) defined class ScalaStuff scala> case class ScalaOuter(name: String, stuff: ScalaStuff) defined class ScalaOuter scala> val encoder2 = Encoders.product[ScalaOuter] encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, stuff[0]: struct<name:string>] scala> val schema2 = encoder2.schema schema2: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema2.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) | |-- name: string (nullable = true) scala> scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter] df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: struct<name: string>] scala> df2.show() +----+-----+ |name|stuff| +----+-----+ | v1| null| +----+-----+ scala> df2.map(x => x).show() +----+-----+ |name|stuff| +----+-----+ | v1| null| +----+-----+ {code} stuff.json: {code} {"name":"v1", "stuff":null } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org