I've been looking at performance differences between spark sql queries against single parquet tables, vs a unionAll of two tables. It's a significant difference, like 5 to 10x
Is there a reason in general not to push projections and predicates down into the individual ParquetTableScans in a union? Here's an example of what I'm talking about: scala> p.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = false) |-- phones: array (nullable = true) | |-- element: string (containsNull = true) scala> p2.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = false) |-- phones: array (nullable = true) | |-- element: string (containsNull = true) scala> val b = p.unionAll(p2) // single table, pushdown scala> p.where('age < 40).select('name) res36: org.apache.spark.sql.SchemaRDD = SchemaRDD[97] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] ParquetTableScan [name#3,age#4], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [(age#4 < 40)] // union of 2 tables, no pushdown scala> b.where('age < 40).select('name) res37: org.apache.spark.sql.SchemaRDD = SchemaRDD[99] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] Filter (age#4 < 40) Union [ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ,ParquetTableScan [name#0,age#1,phones#2], (ParquetRelation /var/tmp/people2, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ] ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql...