Michael Allman commented on SPARK-17983:

cc [~rxin]

I had a feeling there might be some fallout like this. It seems we really need 
to reconcile Hive metastore column names to on-disk column names as part of 

I think I mentioned, and I actually have, a implemented this kind of 
reconciliation that occurs after partition pruning in optimization so that it 
only involves the partitions in the query plan. Obviously, this is big 
improvement over the original behavior which did a scan over every data file in 
the table.

Even though this adds some additional cost to the query planning, I believe 
this can be restricted to the first access of a partition in a given Spark 
session. The straightforward solution would be to cache the table metadata 
incrementally as partitions are scanned. Subsequent requests for partition 
schema and metadata would come from the cache. The cache would be invalidated 
through the usual methods.

This follows along the lines of the "re-add partition caching" task I mentioned 
in the beginning of https://github.com/apache/spark/pull/14690.


> Can't filter over mixed case parquet columns of converted Hive tables
> ---------------------------------------------------------------------
>                 Key: SPARK-17983
>                 URL: https://issues.apache.org/jira/browse/SPARK-17983
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Eric Liang
>            Priority: Critical
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
>     spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>       .partitionBy("partCol1", "partCol2")
>       .mode("overwrite")
>       .parquet(dir.getAbsolutePath)
>     spark.sql(s"""
>       |create external table $tableName (normalCol long)
>       |partitioned by (partCol1 int, partCol2 int)
>       |stored as parquet
>       |location "${dir.getAbsolutePath}"""".stripMargin)
>     spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
>     withTable("test") {
>       withTempDir { dir =>
>         setupPartitionedTable("test", dir)
>         val df = spark.sql("select * from test where normalCol = 3")
>         assert(df.count() == 1)
>       }
>     }
>   }
> {code}
> cc [~cloud_fan]

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to