Reynold Xin resolved SPARK-16980.
       Resolution: Fixed
    Fix Version/s: 2.1.0

> Load only catalog table partition metadata required to answer a query
> ---------------------------------------------------------------------
>                 Key: SPARK-16980
>                 URL: https://issues.apache.org/jira/browse/SPARK-16980
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Michael Allman
>            Assignee: Michael Allman
>             Fix For: 2.1.0
> Currently, when a user reads from a partitioned Hive table whose metadata are 
> not cached (and for which Hive table conversion is enabled and supported), 
> all partition metadata are fetched from the metastore:
> https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260
> However, if the user's query includes partition pruning predicates then we 
> only need the subset of these metadata which satisfy those predicates.
> This issue tracks work to modify the current query planning scheme so that 
> unnecessary partition metadata are not loaded.
> I've prototyped two possible approaches. The first extends 
> {{o.a.s.s.c.catalog.ExternalCatalog}} and as such is more generally 
> applicable. It requires some new abstractions and refactoring of 
> {{HadoopFsRelation}} and {{FileCatalog}}, among others. It places a greater 
> burden on other implementations of {{ExternalCatalog}}. Currently the only 
> other implementation of {{ExternalCatalog}} is {{InMemoryCatalog}}, and my 
> prototype throws an {{UnsupportedOperationException}} on that implementation.
> The second prototype is simpler and only touches code in the {{hive}} 
> project. Basically, conversion of a partitioned {{MetastoreRelation}} to 
> {{HadoopFsRelation}} is deferred to physical planning. During physical 
> planning, the partition pruning filters in the query plan are used to 
> identify the required partition metadata and a {{HadoopFsRelation}} is built 
> from those. The new query plan is then re-injected into the physical planner 
> and proceeds as normal for a {{HadoopFsRelation}}.
> On the Spark dev mailing list, [~ekhliang] expressed a preference for the 
> approach I took in my first POC. (See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html)
>  Based on that, I'm going to open a PR with that patch as a starting point 
> for an architectural/design review. It will not be a complete patch ready for 
> integration into Spark master. Rather, I would like to get early feedback on 
> the implementation details so I can shape the PR before committing a large 
> amount of time on a finished product. I will open another PR for the second 
> approach for comparison if requested.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to