Re: [DISCUSS] Separating out the metastore as its own TLP

Vihang Karajgaonkar Thu, 06 Jul 2017 11:05:05 -0700

I can understand the concerns from Edward and Xuefu and I think they are
valid as well. I think having a regular cadence of release will help
alleviate the concerns related to features making into releases to a
certain extent. Having quarterly or semi-annual releases would be a good
thing in general for both Hive as well as Metastore (if we decide to
separate it). It would help that metastore has the same PMC as Hive since
you would most likely have to get reviews and approvals from the same set
of people that you would now. For features in dev branches, we will have to
come up with a release strategy so that features spanning metastore and
hive would work well while in development. May be using snapshot libraries
of metastore which builds from the latest code and making a metastore
release before releasing Hive so that they are always in sync.


As far as concerns related to features spanning multiple projects are
concerned, it is true even today (Hive, Spark, Tez are all separate
projects), although I agree that it may be a lesser problem for most
features. In my opinion, while it is true that many other projects like
Impala, Presto, Spark use Hive Metastore, they do so primarily to maintain
compatibility with Hive. When their adoption rises what stops them from
creating their own metadata service which suits their needs better? If or
when that happens, it would lead to fragmentation as far as metadata stores
are concerned. If we separate HMS, we can strive to make it a general
purpose metadata service which the other projects would like to adopt
banking on the advantages which it brings now like compatibility with Hive.
I think as long as metastore is within Hive, it will always be Hive's
metastore and other projects will be cautious to adopt it fearing changes
which might break their code and always having to play well with Hive.

Thanks,
Vihang

On Thu, Jul 6, 2017 at 2:10 AM, Peter Vary <[email protected]> wrote:

> Hi folks,
>
> I agree with most of the things Edward said. I have faced similar issues
> in smaller scale when integrated Hive with Yetus. We are forced to keep
> patched Yetus files in Hive repo until they push their next release. Also
> followed one more serious problem when a patch was committed to Hive,
> Impala and Spark, and just few days before the release all of them was
> reverted from the projects due to concerns raised by the Spark committee
> (after the changes was already committed to Spark as well)
>
> Having said all of these, I still think that separating the HMS to a new
> top level project could be a step to the right direction with the following
> constraints. The new project should have:
> - Strict, stability oriented branching strategy following Edward's
> suggestions, so if a downstream project - for example Hive - needs some fix
> or easy change that could be incorporated, and released almost immediately.
> So we have to have these:
>         - Always releasable head
>         - Every multi commit feature should be added as a feature branch
> - Strict, enforced, stability oriented API strategy. So we will not be
> surprised by features added by other projects and break Hive compatibility.
> To avoid this situation we need to design for it, have pre-commit tests in
> place for catch the in-adverted changes, and most importantly have a clear
> commitment for it.
>
> I think, since the current HMS is already used by numerous other projects,
> we already should have these in mind when modifying anything in HMS related
> code. This is not the main focus of Hive, so we do not concentrate on this
> and there are often interoperability issues, problems. We can do this
> inside Hive as well, but the current approach followed by Hive, and the one
> required by the HMS are requiring a different mindset. We need a clear,
> well defined boundary and separating the 2 projects could help in this. We
> can focus on the different needs and goal and eventually we might have
> different culture as well which suits the specific needs of the specific
> part of the code.
>
> I think keeping these rules in the new to level HMS we can mitigate most
> of the issues mentioned below, and we will be better of overall.
> What do you think Edward?
>
> Thanks,
> Peter
>
>
> > On Jul 5, 2017, at 10:16 PM, Xuefu Zhang <[email protected]> wrote:
> >
> > I think Edward's concern is valid. While I voiced my support for this
> > proposal, which was more from the benefits of the whole Hadoop
> ecosystem, I
> > don't see the equal benefits for Hive. Instead, it may even create more
> > overhead for Hive. I'd really like to take time to see what are the road
> > blocks for other projects to use HMS as it is. The issue of Spark
> including
> > a Hive fork, which was brought up some time back, is certainly not one of
> > them.
> >
> > Thanks,
> > Xuefu
> >
> > On Wed, Jul 5, 2017 at 12:33 PM, Edward Capriolo <[email protected]>
> > wrote:
> >
> >> On Wed, Jul 5, 2017 at 1:51 PM, Alan Gates <[email protected]>
> wrote:
> >>
> >>> On Mon, Jul 3, 2017 at 6:20 AM, Edward Capriolo <[email protected]
> >
> >>> wrote:
> >>>
> >>>>
> >>>> We already have things in the meta-store not directly tied to language
> >>>> features. For example hive metastore has a "retention" property which
> >> is
> >>>> not actively in use by anything. In reality, we rarely say 'no' or -1
> >> to
> >>>> much. Which in part is why I believe our release process is grinding
> >>>> slower: we have so many things in flight I do not feel that any one
> >>> person
> >>>> can keep track. You are working on porting the metastore to hbase.
> >>>> https://issues.apache.org/jira/browse/HIVE-9452 did you get a -1 or
> >> 'No'
> >>>> along the way? When I first noticed this I pointed out that someone
> has
> >>>> already ported the metastore to Cassandra
> >>>> https://github.com/riptano/brisk/blob/master/src/java/
> >>>> src/org/apache/cassandra/hadoop/hive/metastore/SchemaManager
> >>> Service.java,
> >>>> but I was more exciting/rational for this multi-year approach using
> >> hbase
> >>>> so I let everyone 'have at it'.
> >>>>
> >>> Your example and mine are not equivalent.  The HBase metastore is
> still a
> >>> Hive feature, even if some thought it not worth while.  That is
> different
> >>> than people bringing features that will never interest Hive or that
> Hive
> >>> could never use (e.g. Dain’s desire for the metastore to support Presto
> >>> style views).
> >>>
> >>> I forgot to mention the issue these would be non-Hive contributors have
> >>> with releases if they contribute their features to the metastore while
> >> it’s
> >>> inside Hive.  Is Hive going to do a release just to push out features
> in
> >>> the metastore that it doesn’t care about?
> >>>
> >>> You seem to be asserting that doing this doesn’t really help non-Hive
> >> based
> >>> systems that are using or would like to use the metastore.  But it is
> >>> interesting that people from three of those systems have commented in
> the
> >>> thread so far, and all are positive (Dmitrias from Impala, Dain from
> >>> Presto, and Sriharsha from the schema registry project).
> >>>
> >>>
> >>>> I am going to give a hypothetical but real world situation. Suppose I
> >>> want
> >>>> to add the statement "CREATE permanent macro xyz", this feature I
> >> believe
> >>>> would cross cut calcite, hive, and hive metastore. To build this
> >> feature
> >>> I
> >>>> would need to orchestrate the change across 3 separate groups of hive
> >>>> 'subcommittees' for lack of a better word. 3 git repos, 3 Jira's 3
> >>>> releases. That is not counting if we run into some bug or misfeature
> >>> (maybe
> >>>> with Tez or something else) so that brings in 4-5 releases of upstream
> >> to
> >>>> add a feature to hive. This does not take into account normal
> processes
> >>>> mess ups. For example say you get the metastore done, but now the
> >> people
> >>>> doing the calcite/antlr suggest the feature have different syntax
> >> because
> >>>> they did not read the 3-4 linked tickets when the process started?
> Now,
> >>> you
> >>>> have to loop back around the process. Finding 1 person in 1 project to
> >>>> usher along the feature you want is difficult, having to find and
> clear
> >>>> time with 3 people across three projects is going to be a difficult
> >> along
> >>>> with then 'pushing' them all to kick out a release so you can finally
> >> use
> >>>> said feature.
> >>>>
> >>>
> >>> I partially agree with you.  On the reviews, JIRAs, etc. I don’t think
> it
> >>> adds much, if any, overhead.  Hive is a big project and no one person
> >> knows
> >>> all the code anymore.  If you wanted to add a permanent macros feature
> >> you
> >>> would need reviews from someone who knows the parser (probably
> >> Pengcheng),
> >>> people who know the optimizer (Jesus, Ashutosh, …), and someone who
> knows
> >>> the metastore (me, Thejas, …).  And any large feature is going to be
> >>> implemented over multiple JIRAs, all of which are linkable regardless
> of
> >>> whether the JIRAs start with METASTORE- or HIVE-.   I also don’t think
> it
> >>> makes the feature disagreement any worse.  If the optimizer team
> >> absolutely
> >>> insists it has to have some feature and the metastore team insists that
> >> it
> >>> can’t have that feature you’re going to have to work through the issue
> >>> whether they all are in Hive or in two separate projects.
> >>>
> >>> Where I agree the split adds cost is releases.  Before your macro
> feature
> >>> could go live you need releases from each of the components.  And while
> >> in
> >>> development the components need to use snapshot versions of the other
> >>> components.  My assertion is that the benefits out weigh this cost.
> >>>
> >>> Alan.
> >>>
> >>
> >>
> >> "You seem to be asserting that doing this doesn’t really help non-Hive
> >> based
> >> systems that are using or would like to use the metastore.  But it is
> >> interesting that people from three of those systems have commented in
> the
> >> thread so far, and all are positive (Dmitrias from Impala, Dain from
> >> Presto, and Sriharsha from the schema registry project)."
> >>
> >> I notice that impala has a syntax for caching.
> >>
> >> https://www.cloudera.com/documentation/enterprise/5-8-x/topi
> >> cs/impala_perf_hdfs_caching.html
> >>
> >> Notice how the cache syntax did not way into Hive? It would make sense
> if
> >> this feature trickled it's way into hive and use HDFS caching for
> example.
> >> I have heard many people claim that using hive metastore is such a
> because
> >> it is packaged weird (like with ORC), but again besides
> claim/complaining
> >> no one has stepped up to deal with that.
> >>
> >> What I would suggest is going forward for maybe a trial period of 6
> months,
> >> labeling JIRA tickets with a tag that would be
> >> "SeeThisProvesWeNeedATLPMetastore". Because right now I do not enough
> >> active use cases of people giving anything back to justify hurting our
> >> workflow so much.
> >>
> >>
> >> "I partially agree with you.  On the reviews, JIRAs, etc. I don’t think
> it
> >> adds much, if any, overhead.  Hive is a big project and no one person
> knows
> >> all the code anymore.  If you wanted to add a permanent macros feature
> you
> >> would need reviews from someone who knows the parser (probably
> Pengcheng),
> >> people who know the optimizer (Jesus, Ashutosh, …), and someone who
> knows
> >> the metastore (me, Thejas, …).  And any large feature is going to be
> >> implemented over multiple JIRAs, all of which are linkable regardless of
> >> whether the JIRAs start with METASTORE- or HIVE-.   I also don’t think
> it
> >> makes the feature disagreement any worse.  If the optimizer team
> absolutely
> >> insists it has to have some feature and the metastore team insists that
> it
> >> can’t have that feature you’re going to have to work through the issue
> >> whether they all are in Hive or in two separate projects"
> >>
> >> Macro was done in 1 patch and reviewed by 2 people. With 2-3 follow on
> >> bugs.
> >>
> >> https://issues.apache.org/jira/browse/HIVE-2655
> >>
> >> I think your perception is different then mine because of
> circumstances. I
> >> have waited weeks/months for reviews/merges (in Hive and other apache
> >> projects) from mundane udfs to cassandra-storage-handlers. You obviously
> >> work in a large company and you can more easily align objectives, go to
> the
> >> water cooler and say "hey bob you know it would be cool if you can
> release
> >> x so I can do y". When you are not in that situation its like, "hey
> mailing
> >> list, my patch was done for three months now and like I have had to
> rebase
> >> it three times and like I notice like other stuff is getting committed."
> >>
> >> If you look at it tactically, "create permanent macro xzy". I go over to
> >> calcite and suggest some changes there, if this concept is not "game
> >> changer" it is probably going to sit unreviewed. If it is "game changer"
> >> exciting that is 72 hours for release voting. Next go to hive-metastore
> >> repeat the process, but remember now I have to "wow" the metastore
> people
> >> with the "game changer" and if that crew is super focused on something
> >> about kafka well now Hive features are second fiddle. Now lets say a
> hive
> >> release is coming up, and I really want my feature in it.
> >> hive-metastore-tlp might currently have a broken trunk because mongo
> wants
> >> to add spaceships to wombats feature has a bug and frankly that should
> not
> >> effect us.
> >>
> >> I hate to draw in something else but I feel it is related:
> >>
> >> 8 December 2016 : release 2.1.1 available
> >> 07 April 2017 : release 1.2.2 available
> >> hive-dev [DISCUSS] Supporting Hadoop-1 and experimental features
> >> hive-dev Re: release chaos?
> >>
> >> I have been vocal about not liking certain branching strategies and
> >> proposals that take us away from releasable trunk. We have steadily
> headed
> >> in a direction where we are pulling things out of hive, and we are not
> able
> >> to turn out releases. We even had a thread "release chaos" talking about
> >> our 5 active branches (with friends I say "jumped the shark"). Pulling
> out
> >> the metastore is only going to make this worse. I do not even see the
> model
> >> as successful. You may say it is great that calcite lets people share
> our
> >> sql dialect or the ORC TLP has 5 committers, but if Hive can not get a
> >> release out the door I do not see us optimizing for the proper thing.
> >>
>
>

Re: [DISCUSS] Separating out the metastore as its own TLP

Reply via email to