[ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586018#comment-15586018
 ] 

Michael Allman edited comment on SPARK-17983 at 10/18/16 5:23 PM:
------------------------------------------------------------------

cc [~rxin]

I had a feeling there might be some fallout like this. It seems we really need 
to reconcile Hive metastore column names to on-disk column names as part of 
planning.

As part of my work on the partition pruning patch, I implemented reconciliation 
that occurs after partition pruning in optimization so that it only involves 
the partitions in the query plan.

Even though this adds some additional cost to the query planning, this is a big 
improvement over the original behavior which did a scan over every data file in 
the table. However, I believe this can be restricted to the first access of a 
partition in a given Spark session through caching. I believe the reconciled 
table schema could be updated incrementally as additional partitions are 
scanned. The cache would be invalidated through the usual methods.

This follows along the lines of the "re-add partition caching" task I mentioned 
in the beginning of https://github.com/apache/spark/pull/14690.

Thoughts?


was (Author: michael):
cc [~rxin]

I had a feeling there might be some fallout like this. It seems we really need 
to reconcile Hive metastore column names to on-disk column names as part of 
planning.

I think I mentioned, and I actually have, a implemented this kind of 
reconciliation that occurs after partition pruning in optimization so that it 
only involves the partitions in the query plan. Obviously, this is big 
improvement over the original behavior which did a scan over every data file in 
the table.

Even though this adds some additional cost to the query planning, I believe 
this can be restricted to the first access of a partition in a given Spark 
session. The straightforward solution would be to cache the table metadata 
incrementally as partitions are scanned. Subsequent requests for partition 
schema and metadata would come from the cache. The cache would be invalidated 
through the usual methods.

This follows along the lines of the "re-add partition caching" task I mentioned 
in the beginning of https://github.com/apache/spark/pull/14690.

Thoughts?

> Can't filter over mixed case parquet columns of converted Hive tables
> ---------------------------------------------------------------------
>
>                 Key: SPARK-17983
>                 URL: https://issues.apache.org/jira/browse/SPARK-17983
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Eric Liang
>            Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
>     spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>       .partitionBy("partCol1", "partCol2")
>       .mode("overwrite")
>       .parquet(dir.getAbsolutePath)
>     spark.sql(s"""
>       |create external table $tableName (normalCol long)
>       |partitioned by (partCol1 int, partCol2 int)
>       |stored as parquet
>       |location "${dir.getAbsolutePath}"""".stripMargin)
>     spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
>     withTable("test") {
>       withTempDir { dir =>
>         setupPartitionedTable("test", dir)
>         val df = spark.sql("select * from test where normalCol = 3")
>         assert(df.count() == 1)
>       }
>     }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to