[ 
https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304873#comment-15304873
 ] 

Cheng Lian commented on SPARK-15632:
------------------------------------

cc [~cloud_fan] [~marmbrus]

> Dataset typed filter operation changes query plan schema
> --------------------------------------------------------
>
>                 Key: SPARK-15632
>                 URL: https://issues.apache.org/jira/browse/SPARK-15632
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Cheng Lian
>
> Dataset typed filter operation changes query plan schema
> Filter operations should never changes query plan schema. However, Dataset 
> typed filter operation does introduce schema change:
> {code}
> case class A(b: Double, a: String)
> val data = Seq(
>   "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
>   "{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
>   "{ 'a': 'bar', 'c': 'extra' }"
> )
> val df1 = spark.read.json(sc.parallelize(data))
> df1.printSchema()
> // root
> //  |-- a: string (nullable = true)
> //  |-- b: long (nullable = true)
> //  |-- c: string (nullable = true)
> val ds1 = df1.as[A]
> ds1.printSchema()
> // root
> //  |-- a: string (nullable = true)
> //  |-- b: long (nullable = true)
> //  |-- c: string (nullable = true)
> val ds2 = ds1.filter(_.b > 1)    // <- Here comes the trouble maker
> ds2.printSchema()
> // root                             <- 1. reordered `a` and `b`, and
> //  |-- b: double (nullable = true)    2. dropped `c`, and
> //  |-- a: string (nullable = true)    3. up-casted `b` from long to double
> val df2 = ds2.toDF()
> df2.printSchema()
> // root                             <- (Same as above)
> //  |-- b: double (nullable = true)
> //  |-- a: string (nullable = true)
> {code}
> This is becase we wraps the actual {{Filter}} operator with a 
> {{SerializeFromObject}}/{{DeserializeToObject}} pair.
> {{DeserializeToObject}} does a bunch of magic tricks here:
> # Field order change
> #- {{DeserializeToObject}} resolves the encoder deserializer expression by 
> **name**. Thus field order in input query plan doesn't matter.
> # Field number change
> #- Same as above, fields not referred by the encoder are silently dropped 
> while resolving deserializer expressions by name.
> # Field data type change
> #- When generating deserializer expressions, we allows "sane" implicit 
> coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} 
> operators. Thus actual field data types in input query plan don't matter 
> either as long as there are valid implicit coercions.
> Actually, even field names may change once [PR 
> #13269|https://github.com/apache/spark/pull/13269] gets merged, because it 
> introduces case-insensitive encoder resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to