[ https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304873#comment-15304873 ]
Cheng Lian commented on SPARK-15632: ------------------------------------ cc [~cloud_fan] [~marmbrus] > Dataset typed filter operation changes query plan schema > -------------------------------------------------------- > > Key: SPARK-15632 > URL: https://issues.apache.org/jira/browse/SPARK-15632 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 2.0.0 > Reporter: Cheng Lian > > Dataset typed filter operation changes query plan schema > Filter operations should never changes query plan schema. However, Dataset > typed filter operation does introduce schema change: > {code} > case class A(b: Double, a: String) > val data = Seq( > "{ 'a': 'foo', 'b': 1, 'c': 'extra' }", > "{ 'a': 'bar', 'b': 2, 'c': 'extra' }", > "{ 'a': 'bar', 'c': 'extra' }" > ) > val df1 = spark.read.json(sc.parallelize(data)) > df1.printSchema() > // root > // |-- a: string (nullable = true) > // |-- b: long (nullable = true) > // |-- c: string (nullable = true) > val ds1 = df1.as[A] > ds1.printSchema() > // root > // |-- a: string (nullable = true) > // |-- b: long (nullable = true) > // |-- c: string (nullable = true) > val ds2 = ds1.filter(_.b > 1) // <- Here comes the trouble maker > ds2.printSchema() > // root <- 1. reordered `a` and `b`, and > // |-- b: double (nullable = true) 2. dropped `c`, and > // |-- a: string (nullable = true) 3. up-casted `b` from long to double > val df2 = ds2.toDF() > df2.printSchema() > // root <- (Same as above) > // |-- b: double (nullable = true) > // |-- a: string (nullable = true) > {code} > This is becase we wraps the actual {{Filter}} operator with a > {{SerializeFromObject}}/{{DeserializeToObject}} pair. > {{DeserializeToObject}} does a bunch of magic tricks here: > # Field order change > #- {{DeserializeToObject}} resolves the encoder deserializer expression by > **name**. Thus field order in input query plan doesn't matter. > # Field number change > #- Same as above, fields not referred by the encoder are silently dropped > while resolving deserializer expressions by name. > # Field data type change > #- When generating deserializer expressions, we allows "sane" implicit > coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} > operators. Thus actual field data types in input query plan don't matter > either as long as there are valid implicit coercions. > Actually, even field names may change once [PR > #13269|https://github.com/apache/spark/pull/13269] gets merged, because it > introduces case-insensitive encoder resolution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org