I've been looking at performance differences between spark sql queries
against single parquet tables, vs a unionAll of two tables.  It's a
significant difference, like 5 to 10x

Is there a reason in general not to push projections and predicates down
into the individual ParquetTableScans in a union?

Here's an example of what I'm talking about:


scala> p.printSchema
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- phones: array (nullable = true)
 |    |-- element: string (containsNull = true)


scala> p2.printSchema
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- phones: array (nullable = true)
 |    |-- element: string (containsNull = true)


scala> val b = p.unionAll(p2)


// single table, pushdown
scala> p.where('age < 40).select('name)
res36: org.apache.spark.sql.SchemaRDD =
SchemaRDD[97] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
Project [name#3]
 ParquetTableScan [name#3,age#4], (ParquetRelation /var/tmp/people,
Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [(age#4 <
40)]


// union of 2 tables, no pushdown
scala> b.where('age < 40).select('name)
res37: org.apache.spark.sql.SchemaRDD =
SchemaRDD[99] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
Project [name#3]
 Filter (age#4 < 40)
  Union [ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation
/var/tmp/people, Some(Configuration: core-default.xml, core-site.xml,
mapred-default.xml, mapred-site.xml),
org.apache.spark.sql.SQLContext@6d7e79f6, []), []
,ParquetTableScan [name#0,age#1,phones#2], (ParquetRelation
/var/tmp/people2, Some(Configuration: core-default.xml, core-site.xml,
mapred-default.xml, mapred-site.xml),
org.apache.spark.sql.SQLContext@6d7e79f6, []), []
]
   ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation
/var/tmp/people, Some(Configuration: core-default.xml, core-site.xml,
mapred-default.xml, mapred-site.xml), org.apache.spark.sql...

Reply via email to