Cheng Lian created SPARK-15632:
----------------------------------
Summary: Dataset typed filter operation changes query plan schema
Key: SPARK-15632
URL: https://issues.apache.org/jira/browse/SPARK-15632
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Dataset typed filter operation changes query plan schema
Filter operations should never changes query plan schema. However, Dataset
typed filter operation does introduce schema change:
{code}
case class A(b: Double, a: String)
val data = Seq(
"{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
"{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
"{ 'a': 'bar', 'c': 'extra' }"
)
val df1 = spark.read.json(sc.parallelize(data))
df1.printSchema()
// root
// |-- a: string (nullable = true)
// |-- b: long (nullable = true)
// |-- c: string (nullable = true)
val ds1 = df1.as[A]
ds1.printSchema()
// root
// |-- a: string (nullable = true)
// |-- b: long (nullable = true)
// |-- c: string (nullable = true)
val ds2 = ds1.filter(_.b > 1) // <- Here comes the trouble maker
ds2.printSchema()
// root <- 1. reordered `a` and `b`, and
// |-- b: double (nullable = true) 2. dropped `c`, and
// |-- a: string (nullable = true) 3. up-casted `b` from long to double
val df2 = ds2.toDF()
df2.printSchema()
// root <- (Same as above)
// |-- b: double (nullable = true)
// |-- a: string (nullable = true)
{code}
This is becase we wraps the actual {{Filter}} operator with a
{{SerializeFromObject}}/{{DeserializeToObject}} pair.
{{DeserializeToObject}} does a bunch of magic tricks here:
# Field order change
#- {{DeserializeToObject}} resolves the encoder deserializer expression by
**name**. Thus field order in input query plan doesn't matter.
# Field number change
#- Same as above, fields not referred by the encoder are silently dropped while
resolving deserializer expressions by name.
# Field data type change
#- When generating deserializer expressions, we allows "sane" implicit
coercions (e.g. integer to long, and long to double) by inserting {{UpCast}}
operators. Thus actual field data types in input query plan don't matter either
as long as there are valid implicit coercions.
Actually, even field names may change once [PR
#13269|https://github.com/apache/spark/pull/13269] gets merged, because it
introduces case-insensitive encoder resolution.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]