[jira] [Commented] (HIVE-17159) Make metastore a separately releasable module

Alan Gates (JIRA) Tue, 25 Jul 2017 09:20:48 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100299#comment-16100299
 ]


Alan Gates commented on HIVE-17159:
-----------------------------------

Thoughts on how to proceed:

We should do this in master, not in a branch.  This avoids having to port 
patches when people make changes.  It does expose everyone to the moving parts, 
but part of what we want to prove out is whether or not this is painful to 
other developers, so this is actually a feature.

Not everything in the current metastore module can be released separately as 
is.  There are various reasons for this, detailed below.  But I believe this 
means the easiest way to proceed is to create a new maven module 
"metastore-standalone" or something, and move the files we want into that.  
Putting the files in a new maven module also allows us to easily confirm the 
separate metastore's dependencies.

Despite physically moving the files we should not change the package names now. 
 Though we will probably want to change the package names if this becomes its 
own TLP, doing so now would be disruptive with no benefit.

The goal is to not have Hive dependencies in the separate metastore, with the 
exception of the storage-api.  I propose keeping the storage-api because it has 
the SARGs in it and I believe these will be useful in the future as a way to 
describe partition filter expressions.  This does mean that some code has to be 
copied or detangled.

In looking through this and doing some POC work, I found the following areas 
that needed copied, detangled, or left behind:
* The conf file.  A similar class to HiveConf can be built that has the 
relevant ConfVars.  This class will need to extend Hadoop's Configuration class 
since that is a common base class with HiveConf and HiveConf cannot be directly 
subclassed.  It will also need to continue to support Hive configuration values.
* I propose we leave the HBase metastore in the current metastore module, or 
remove it altogether.  As far as I know no one is using it and no active work 
is going on in that space.
* I propose we leave behind the FileMetadataHandler.  Currently this is only 
implemented in the HBase metastore.  This means we would need to modify 
PartitionExpressionProxy to split out the calls that deal with file metadata 
into a sub-interface.
* The shims.  If we assume this will never be used against Hadoop 1 then we can 
drop most of the shims.  Mostly the metastore makes HDFS calls and uses the 
thrift security classes.  Neither of these change as far as I could tell 
between Hadoop 2 and 3.  This means that rather than copy the shims we can 
remove them and make direct calls to the Hadoop methods.  We will need to copy 
the HadoopThriftAuthBridge code, but only the Hadoop 2 version of it.
* TypeInfo.  This is in the serde module.  It cannot be easily moved out into 
storage-api because there is a circular dependency between TypeInfo, SerDe, and 
ObjectInspector.  The metastore only uses three things from TypeInfo: 1) the 
type names; 2) type groupings (e.g. string, char, varchar are all string 
types); and 3) supported casts when alterting column types.  By copying just 
this logic into the metastore I was able to remove the dependency on TypeInfo.  
Ideally we will find a place where these type definitions can live and be 
shared by the metastore, Hive, file formats like ORC and Parquet, and any other 
engines that wish to use them.
* Metrics.  Assuming we choose not to support the legacy Hive metrics from Hive 
1 we can use codahale metrics directly and not copy over Hive's metrics code.  
This includes copying the JvmPauseMonitor.
* Handling tables with the schema in the file.  Currently this is done using a 
SerDe.  Given that we want this to be independent of the serde module that 
won't work.  I propose creating an interface that systems can implement to read 
schemas from storage.  Hive cna continue to use it's SerDe to implement this 
interface.
* There are a few features that today work by calling back into ql, e.g. the 
PartitionExpressionProxy and the compactor threads.  Neither of these appear 
easy to resolve.  PartitionExpressionProxy works directly on the Hive AST.  
This should be converted to use SARGs.  I haven't scoped how much work this is. 
 We will still need to support the old AST methods as well for backwards 
compatibility.  I don't have a good proposal right now on how to handle the 
compactor threads.  For now these features will only work if the user deploys 
the stand alone metastore with the rest of Hive jars.



> Make metastore a separately releasable module
> ---------------------------------------------
>
>                 Key: HIVE-17159
>                 URL: https://issues.apache.org/jira/browse/HIVE-17159
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>
> As proposed in this 
> [thread|https://lists.apache.org/thread.html/5e75f45d60f0b819510814a126cfd3809dd24b1c7035a1c8c41b0c5c@%3Cdev.hive.apache.org%3E]
>  on the dev list, we should move the metastore into a separately releasable 
> module.  This is a POC of and potential first step towards separating out the 
> metastore as a separate Apache TLP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17159) Make metastore a separately releasable module

Reply via email to