[jira] [Updated] (SPARK-16980) Load only catalog table partition metadata required to answer a query

Michael Allman (JIRA) Tue, 09 Aug 2016 11:18:40 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Allman updated SPARK-16980:
-----------------------------------
    Description: 
Currently, when a user reads from a partitioned Hive table whose metadata are 
not cached (and for which Hive table conversion is enabled and supported), all 
partition metadata are fetched from the metastore:

https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260

However, if the user's query includes partition pruning predicates then we only 
need the subset of these metadata which satisfy those predicates.

This issue tracks work to modify the current query planning scheme so that 
unnecessary partition metadata are not loaded.

I've prototyped two possible approaches. The first extends 
{{o.a.s.s.c.catalog.ExternalCatalog}} and as such is more generally applicable. 
It requires some new abstractions and refactoring of {{HadoopFsRelation}} and 
{{FileCatalog}}, among others. It places a greater burden on other 
implementations of {{ExternalCatalog}}. Currently the only other implementation 
of {{ExternalCatalog}} is {{InMemoryCatalog}}, and my prototype throws an 
{{UnsupportedOperationException}} on that implementation.

The second prototype is simpler and only touches code in the {{hive}} project. 
Basically, conversion of a partitioned {{MetastoreRelation}} to 
{{HadoopFsRelation}} is deferred to physical planning. During physical 
planning, the partition pruning filters in the query plan are used to identify 
the required partition metadata and a {{HadoopFsRelation}} is built from those. 
The new query plan is then re-injected into the physical planner and proceeds 
as normal for a {{HadoopFsRelation}}.

On the Spark dev mailing list, [~ekhliang] expressed a preference for the 
approach I took in my first POC. (See 
http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html)
 Based on that, I'm going to open a PR with that patch as a starting point for 
an architectural/design review. It will not be a complete patch ready for 
integration into Spark master. Rather, I would like to get early feedback on 
the implementation details so I can shape the PR before committing a large 
amount of time on a finished product. I will open another PR for the second 
approach for comparison if requested.

  was:
Currently, when a user reads from a partitioned Hive table whose metadata are 
not cached (and for which Hive table conversion is enabled and supported), all 
partition metadata are fetched from the metastore:

https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260

However, if the user's query includes partition pruning predicates then we only 
need the subset of these metadata which satisfy those predicates.

This issue tracks work to modify the current query planning scheme so that 
unnecessary partition metadata are not loaded.

I've prototyped two possible approaches. The first extends 
`org.apache.spark.sql.catalyst.catalog.ExternalCatalog` and as such is more 
generally applicable. It requires some new abstractions and refactoring of 
`HadoopFsRelation` and `FileCatalog`, among others. It places a greater burden 
on other implementations of `ExternalCatalog`. Currently the only other 
implementation of `ExternalCatalog` is `InMemoryCatalog`, and my prototype 
throws an `UnsupportedOperationException` on that implementation.

The second prototype is simpler and only touches code in the `hive` project. 
Basically, conversion of a partitioned `MetastoreRelation` to 
`HadoopFsRelation` is deferred to physical planning. During physical planning, 
the partition pruning filters in the query plan are used to identify the 
required partition metadata and a `HadoopFsRelation` is built from those. The 
new query plan is then re-injected into the physical planner and proceeds as 
normal for a `HadoopFsRelation`.

On the Spark dev mailing list, [~ekhliang] expressed a preference for the 
approach I took in my first POC. (See 
http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html)
 Based on that, I'm going to open a PR with that patch as a starting point for 
an architectural/design review. It will not be a complete patch ready for 
integration into Spark master. Rather, I would like to get early feedback on 
the implementation details so I can shape the PR before committing a large 
amount of time on a finished product. I will open another PR for the second 
approach for comparison if requested.


> Load only catalog table partition metadata required to answer a query
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16980
>                 URL: https://issues.apache.org/jira/browse/SPARK-16980
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Michael Allman
>
> Currently, when a user reads from a partitioned Hive table whose metadata are 
> not cached (and for which Hive table conversion is enabled and supported), 
> all partition metadata are fetched from the metastore:
> https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260
> However, if the user's query includes partition pruning predicates then we 
> only need the subset of these metadata which satisfy those predicates.
> This issue tracks work to modify the current query planning scheme so that 
> unnecessary partition metadata are not loaded.
> I've prototyped two possible approaches. The first extends 
> {{o.a.s.s.c.catalog.ExternalCatalog}} and as such is more generally 
> applicable. It requires some new abstractions and refactoring of 
> {{HadoopFsRelation}} and {{FileCatalog}}, among others. It places a greater 
> burden on other implementations of {{ExternalCatalog}}. Currently the only 
> other implementation of {{ExternalCatalog}} is {{InMemoryCatalog}}, and my 
> prototype throws an {{UnsupportedOperationException}} on that implementation.
> The second prototype is simpler and only touches code in the {{hive}} 
> project. Basically, conversion of a partitioned {{MetastoreRelation}} to 
> {{HadoopFsRelation}} is deferred to physical planning. During physical 
> planning, the partition pruning filters in the query plan are used to 
> identify the required partition metadata and a {{HadoopFsRelation}} is built 
> from those. The new query plan is then re-injected into the physical planner 
> and proceeds as normal for a {{HadoopFsRelation}}.
> On the Spark dev mailing list, [~ekhliang] expressed a preference for the 
> approach I took in my first POC. (See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html)
>  Based on that, I'm going to open a PR with that patch as a starting point 
> for an architectural/design review. It will not be a complete patch ready for 
> integration into Spark master. Rather, I would like to get early feedback on 
> the implementation details so I can shape the PR before committing a large 
> amount of time on a finished product. I will open another PR for the second 
> approach for comparison if requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16980) Load only catalog table partition metadata required to answer a query

Reply via email to