[jira] [Comment Edited] (SPARK-15691) Refactor and improve Hive support

Xiao Li (JIRA) Sat, 04 Jun 2016 00:02:18 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315357#comment-15315357
 ]


Xiao Li edited comment on SPARK-15691 at 6/4/16 7:01 AM:
---------------------------------------------------------

Finally, {{HiveMetastoreCatalog.scala}} has been cleaned in my private local 
branch. Now, it only contains a single class for Data Source Table caching with 
a few public APIs. This file does not depend on any hive-related entities. 
Thus, we can decide whether it will be moved to sql/core or keep it in 
sql/hive. The LOC is reduced to 90. We need a new name! : )

The four Hive-specific {{Analyzer}} rules have been moved to 
{{HiveStrategies.scala}}. This is just like {{DataSourceStrategy.scala}}, which 
has both {{Anazlyer}} rules and {{SparkPlanner}} strategies. Two related rules 
{{ParquetConversions}} and {{OrcConversions}} are combined to a single Rule 
{{ConvertMetastoreTables}}. 

Also, combined the duplicate code and refactored a few code. 

Will write and upload a design doc tomorrow to document the change details and 
the change reasons for further review. Thanks!


was (Author: smilegator):
Finally, {{HiveMetastoreCatalog}} has been cleaned in my private local branch. 
Now, it becomes a pure Data Source Table cache. The LOC is reduced to 90. We 
need a new name!

The four Hive-specific {{Analyzer}} rules are moved to 
{{HiveStrategies.scala}}. This is just like {{DataSourceStrategy.scala}}, which 
has both {{Anazlyer}} rules and {{SparkPlanner}} strategies.

Also, combined the duplicate code and refactored a few code. 

Will write and upload a design doc tomorrow to document the change details for 
further review. Thanks!

> Refactor and improve Hive support
> ---------------------------------
>
>                 Key: SPARK-15691
>                 URL: https://issues.apache.org/jira/browse/SPARK-15691
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-15691) Refactor and improve Hive support

Reply via email to