[
https://issues.apache.org/jira/browse/HIVE-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100299#comment-16100299
]
Alan Gates commented on HIVE-17159:
-----------------------------------
Thoughts on how to proceed:
We should do this in master, not in a branch. This avoids having to port
patches when people make changes. It does expose everyone to the moving parts,
but part of what we want to prove out is whether or not this is painful to
other developers, so this is actually a feature.
Not everything in the current metastore module can be released separately as
is. There are various reasons for this, detailed below. But I believe this
means the easiest way to proceed is to create a new maven module
"metastore-standalone" or something, and move the files we want into that.
Putting the files in a new maven module also allows us to easily confirm the
separate metastore's dependencies.
Despite physically moving the files we should not change the package names now.
Though we will probably want to change the package names if this becomes its
own TLP, doing so now would be disruptive with no benefit.
The goal is to not have Hive dependencies in the separate metastore, with the
exception of the storage-api. I propose keeping the storage-api because it has
the SARGs in it and I believe these will be useful in the future as a way to
describe partition filter expressions. This does mean that some code has to be
copied or detangled.
In looking through this and doing some POC work, I found the following areas
that needed copied, detangled, or left behind:
* The conf file. A similar class to HiveConf can be built that has the
relevant ConfVars. This class will need to extend Hadoop's Configuration class
since that is a common base class with HiveConf and HiveConf cannot be directly
subclassed. It will also need to continue to support Hive configuration values.
* I propose we leave the HBase metastore in the current metastore module, or
remove it altogether. As far as I know no one is using it and no active work
is going on in that space.
* I propose we leave behind the FileMetadataHandler. Currently this is only
implemented in the HBase metastore. This means we would need to modify
PartitionExpressionProxy to split out the calls that deal with file metadata
into a sub-interface.
* The shims. If we assume this will never be used against Hadoop 1 then we can
drop most of the shims. Mostly the metastore makes HDFS calls and uses the
thrift security classes. Neither of these change as far as I could tell
between Hadoop 2 and 3. This means that rather than copy the shims we can
remove them and make direct calls to the Hadoop methods. We will need to copy
the HadoopThriftAuthBridge code, but only the Hadoop 2 version of it.
* TypeInfo. This is in the serde module. It cannot be easily moved out into
storage-api because there is a circular dependency between TypeInfo, SerDe, and
ObjectInspector. The metastore only uses three things from TypeInfo: 1) the
type names; 2) type groupings (e.g. string, char, varchar are all string
types); and 3) supported casts when alterting column types. By copying just
this logic into the metastore I was able to remove the dependency on TypeInfo.
Ideally we will find a place where these type definitions can live and be
shared by the metastore, Hive, file formats like ORC and Parquet, and any other
engines that wish to use them.
* Metrics. Assuming we choose not to support the legacy Hive metrics from Hive
1 we can use codahale metrics directly and not copy over Hive's metrics code.
This includes copying the JvmPauseMonitor.
* Handling tables with the schema in the file. Currently this is done using a
SerDe. Given that we want this to be independent of the serde module that
won't work. I propose creating an interface that systems can implement to read
schemas from storage. Hive cna continue to use it's SerDe to implement this
interface.
* There are a few features that today work by calling back into ql, e.g. the
PartitionExpressionProxy and the compactor threads. Neither of these appear
easy to resolve. PartitionExpressionProxy works directly on the Hive AST.
This should be converted to use SARGs. I haven't scoped how much work this is.
We will still need to support the old AST methods as well for backwards
compatibility. I don't have a good proposal right now on how to handle the
compactor threads. For now these features will only work if the user deploys
the stand alone metastore with the rest of Hive jars.
> Make metastore a separately releasable module
> ---------------------------------------------
>
> Key: HIVE-17159
> URL: https://issues.apache.org/jira/browse/HIVE-17159
> Project: Hive
> Issue Type: New Feature
> Components: Metastore
> Reporter: Alan Gates
> Assignee: Alan Gates
>
> As proposed in this
> [thread|https://lists.apache.org/thread.html/5e75f45d60f0b819510814a126cfd3809dd24b1c7035a1c8c41b0c5c@%3Cdev.hive.apache.org%3E]
> on the dev list, we should move the metastore into a separately releasable
> module. This is a POC of and potential first step towards separating out the
> metastore as a separate Apache TLP.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)