Cheng Lian created SPARK-15632:
----------------------------------

             Summary: Dataset typed filter operation changes query plan schema
                 Key: SPARK-15632
                 URL: https://issues.apache.org/jira/browse/SPARK-15632
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Cheng Lian


Dataset typed filter operation changes query plan schema

Filter operations should never changes query plan schema. However, Dataset 
typed filter operation does introduce schema change:

{code}
case class A(b: Double, a: String)

val data = Seq(
  "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
  "{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
  "{ 'a': 'bar', 'c': 'extra' }"
)

val df1 = spark.read.json(sc.parallelize(data))
df1.printSchema()
// root
//  |-- a: string (nullable = true)
//  |-- b: long (nullable = true)
//  |-- c: string (nullable = true)

val ds1 = df1.as[A]
ds1.printSchema()
// root
//  |-- a: string (nullable = true)
//  |-- b: long (nullable = true)
//  |-- c: string (nullable = true)

val ds2 = ds1.filter(_.b > 1)    // <- Here comes the trouble maker
ds2.printSchema()
// root                             <- 1. reordered `a` and `b`, and
//  |-- b: double (nullable = true)    2. dropped `c`, and
//  |-- a: string (nullable = true)    3. up-casted `b` from long to double

val df2 = ds2.toDF()
df2.printSchema()
// root                             <- (Same as above)
//  |-- b: double (nullable = true)
//  |-- a: string (nullable = true)
{code}

This is becase we wraps the actual {{Filter}} operator with a 
{{SerializeFromObject}}/{{DeserializeToObject}} pair.

{{DeserializeToObject}} does a bunch of magic tricks here:

# Field order change
#- {{DeserializeToObject}} resolves the encoder deserializer expression by 
**name**. Thus field order in input query plan doesn't matter.
# Field number change
#- Same as above, fields not referred by the encoder are silently dropped while 
resolving deserializer expressions by name.
# Field data type change
#- When generating deserializer expressions, we allows "sane" implicit 
coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} 
operators. Thus actual field data types in input query plan don't matter either 
as long as there are valid implicit coercions.

Actually, even field names may change once [PR 
#13269|https://github.com/apache/spark/pull/13269] gets merged, because it 
introduces case-insensitive encoder resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to