Re: [DISCUSS] Separating out the metastore as its own TLP

Alan Gates Sun, 02 Jul 2017 19:16:27 -0700

Comments inlined.

On Sun, Jul 2, 2017 at 3:22 PM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:


> I am not sure I am on the fence with this.
>
> I am -1, and I offer this -1 with the hope of being convinced otherwise
>
Thank you for being open to reconsider.

>
>
> "By making it a separate project we will enable other projects to join us
> in
> innovating on the metastore. "
>
> The relevant questions I have are,
>
> "What is stopping others from joining us now?"
> "What does being a TLP do for us that we do not have now?"
>

Walking through a use case will help answer these.  This is a real world
situation, not a hypothetical.  I’ve been talking with a team building a
schema registry for Kafka[1].  I’d like them to use the Hive metastore
rather than reinvent the wheel.  I believe this would be good for users
(all their tools can work together on a shared understanding of the data)
and admins (just one metadata store to administer) and for the ecosytems
(tools can work across stored data and streaming data).

This system has some requirements on metadata that Hive does not.  To take
one example, it would like a schema to be a top level concept instead of a
concept tied to tables or partitions.  This is not a problem for Hive, but
neither is it interesting.  So if they come with patches for this, would we
accept them?  As the Hive PMC our answer will be no, because it doesn’t
help Hive’s metadata.  Even if we accept their patches will we make them
committers when we know they don’t care about Hive as Hive, but only the
metastore.  Again, the right answer for the Hive PMC is no.

And we cannot say that Hive should support a generic metadata system within
itself.  That turns Hive into an umbrella project, which Apache has
repeatedly worked to avoid.  So Hive will either need to reject non-Hive
centric features and contributors or end up in a place Apache has worked to
avoid.

And finally, why would other teams want to mess with all of Hive when they
only want the metastore?  Hive is a large and complex system.  If we break
the metastore out it is much more approachable by non-Hive contributors.

Obviously the Hive team doesn’t want to see their metastore turn into
something unusable by Hive, which is why we were specific in saying we
wanted it to continue to support high performance SQL systems.

My experience in watching ORC move out of Hive is that the adoption has
increased significantly.  It is reasonable to assume that moving the
metastore out will also increase adoption and make it easier for others to
get involved.


> I see a lot of downsides:
> 1) We have to maintain two sites
> 2) we have to maintain two committer lists
>
> A large problem I see is this: Hive is already being pulled in too many
> different directions. There is some grumbling about the state of
> hive-on-spark.
>

I believe this argues in favor of the split, not against.  By pulling out
the metastore we are releiving pressure on Hive itself.  Let Hive focus on
being a SQL engine.  Let another team focus on runtime metadata.

On your committer questions in later emails, the point of going to a TLP
has nothing to do with adding new committers.   Traditionally new projects
start in the incubator.  But given that all of the PMC of this new project
are already experienced Hive PMC members I see no reason to go through
incubator.  I agree with you that we would not throw any new people into
the mix.  People join the project in the same way as always, by
contributing.

Alan.

1. https://github.com/hortonworks/registry


> Most importantly, our release process seems 'injured' by too many branches
> going off in different ways. If the metastore lives outside of Hive we are
> going to compound this issue. I would strongly suggest we do not undertake
> this until we can at least turn out 2 usable releases in a 6 month period.
>

Re: [DISCUSS] Separating out the metastore as its own TLP

Reply via email to