Re: Start releasing the master branch

Stamatis Zampetakis Wed, 02 Mar 2022 00:52:14 -0800

Agree with Peter, creating JIRAs is the way to go.

Putting the appropriate priority (e.g., BLOCKER) and version (4.0.0 or
4.0.0-alpha-1) when creating the JIRA should be enough to keep us on track.
I am mentioning both 4.0.0 and 4.0.0-alpha-1 because eventually I think we
are gonna move everything with target 4.0.0 to 4.0.0-alpha-1.


On Wed, Mar 2, 2022 at 9:37 AM Peter Vary <[email protected]>
wrote:

> Hi Team,
>
> Could we create tickets for the issues?
> I think it would be good to collect the issues/potential blockers in the
> jira instead of having a complicated mail thread.
>
> If we set the target version to 4.0.0-alpha-1, then we can easily use the
> following filter to see the status of the tasks:
>
> https://issues.apache.org/jira/issues/?jql=project%3D%22HIVE%22%20AND%20%22Target%20Version%2Fs%22%3D%224.0.0-alpha-1%22
> <
> https://issues.apache.org/jira/issues/?jql=project=%22HIVE%22%20AND%20%22Target%20Version/s%22=%224.0.0-alpha-1%22
> >
>
> @Stamatis: Sadly I have missed your letter/jira and created my own with
> the fix for building from the src package:
> https://issues.apache.org/jira/browse/HIVE-25997 <
> https://issues.apache.org/jira/browse/HIVE-25997>
> If you have time, I would like to ask you to review.
>
> If anyone knows of any blocker I would like to ask them to create a jira
> for that too.
>
> Thanks,
> Peter
>
>
> > On 2022. Mar 2., at 7:04, Sungwoo Park <[email protected]> wrote:
> >
> > Hello Alessandro,
> >
> > For the latest commit, loading ORC tables fails (with the log message
> shown below). Let me try to find a commit that introduces this bug and
> create a JIRA ticket.
> >
> > --- Sungwoo
> >
> > 2022-03-02 05:41:56,578 ERROR [Thread-73] exec.StatsTask: Failed to run
> stats task
> > java.io.IOException: org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
> >  at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:622)
> >  at
> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.constructColumnStatsFromPackedRows(ColStatsProcessor.java:105)
> >  at
> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:200)
> >  at
> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:93)
> >  at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
> >  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212)
> >  at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
> >  at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:83)
> > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path
> does not exist:
> hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
> >  at
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
> >  at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
> >  at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
> >  at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
> >  at
> org.apache.hadoop.hive.ql.exec.FetchOperator.generateWrappedSplits(FetchOperator.java:435)
> >  at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:402)
> >  at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:306)
> >  at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560)
> >  ... 7 more
> >
> > On Tue, 1 Mar 2022, Alessandro Solimando wrote:
> >
> >> Hi Sungwoo,
> >> last time I tried to run TPCDS-based benchmark I stumbled upon a similar
> >> situation, finally I found that statistics were not computed, so CBO was
> >> not kicking in, and the automatic retry goes with CBO off which was
> failing
> >> for something like 10 queries (subqueries cannot be decorrelated, but
> also
> >> some runtime errors).
> >>
> >> Making sure that (column) statistics were correctly computed fixed the
> >> problem.
> >>
> >> Can you check if this is the case for you?
> >>
> >> HTH,
> >> Alessandro
> >>
> >> On Tue, 1 Mar 2022 at 15:28, POSTECH CT <[email protected]> wrote:
> >>
> >>> Hello Hive team,
> >>>
> >>> I wonder if anyone in the Hive team has tried the TPC-DS benchmark on
> >>> the master branch recently.  We occasionally run TPC-DS system tests
> >>> using the master branch, and the tests don't succeed completely. Here
> >>> is how our TPC-DS tests proceed.
> >>>
> >>> 1. Compile and run Hive on Tez (not Hive-LLAP)
> >>> 2. Load ORC tables from 1TB TPC-DS raw text data, and compute
> statistics
> >>> 3. Run 99 TPC-DS queries which were slightly modified to return
> >>> varying number of rows (rather than 100 rows)
> >>> 4. Compare the results against the previous results
> >>>
> >>> The previous results were obtained and cross-checked by running Hive
> >>> 3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their
> >>> correctness.
> >>>
> >>> For the latest commit in the master branch, step 2 fails. For earlier
> >>> commits (for example, commits in February 2021), step 3 fails where
> >>> several queries either fail or return wrong results.
> >>>
> >>> We can compile and report the test results in this mailing list, but
> >>> would like to know if similar results have been reproduced by the Hive
> >>> team, in order to make sure that we did not make errors in our tests.
> >>>
> >>> If it is okay to open a JIRA ticket that only reports failures in the
> >>> TPC-DS test, we could also perform git bi-sect to locate the commit
> >>> that begin to generate wrong results.
> >>>
> >>> --- Sungwoo Park
> >>>
> >>> On Tue, 1 Mar 2022, Zoltan Haindrich wrote:
> >>>
> >>>> Hey,
> >>>>
> >>>> Great to hear that we are on the same side regarding these things :)
> >>>>
> >>>> For around a week now - we have nightly builds for the master branch:
> >>>> http://ci.hive.apache.org/job/hive-nightly/12/
> >>>>
> >>>> I think we have 1 blocker issue:
> >>>> https://issues.apache.org/jira/browse/HIVE-25665
> >>>>
> >>>> I know about one more thing I would rather get fixed before we release
> >>> it:
> >>>> https://issues.apache.org/jira/browse/HIVE-25994
> >>>> The best would be to introduce smoke tests (HIVE-22302) to ensure that
> >>>> something like this will not happen in the future - but we should
> >>> probably
> >>>> start moving forward.
> >>>>
> >>>> I think we could call the first iteration of this as "4.0.0-alpha-1"
> :)
> >>>>
> >>>> I've added 4.0.0-alpha-1 as a version - and added the above two ticket
> >>> to it.
> >>>>
> >>>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%204.0.0-alpha-1
> >>>>
> >>>> Are there any more things you guys know which would be needed?
> >>>>
> >>>> cheers,
> >>>> Zoltan
> >>>>
> >>>>
> >>>> On 2/22/22 12:18 PM, Peter Vary wrote:
> >>>>> I would vote for 4.0.0-alpha-1 or similar for all of the components.
> >>>>>
> >>>>> When we have more stable releases I would keep the 4.x.x schema,
> since
> >>>>> everyone is familiar with it, and I do not see a really good reason
> to
> >>>>> change it.
> >>>>>
> >>>>> Thanks,
> >>>>> Peter
> >>>>>
> >>>>>
> >>>>>> On 2022. Feb 10., at 3:34, Szehon Ho <[email protected]>
> wrote:
> >>>>>>
> >>>>>> +1 that would be awesome to see Hive master released after so long.
> >>>>>>
> >>>>>> Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would
> >>> pick
> >>>>>> any 3.x or calendar date (which could tend to slip and be more
> >>>>>> confusing?).
> >>>>>>
> >>>>>> Thanks in any case to get the ball rolling.
> >>>>>> Szehon
> >>>>>>
> >>>>>> On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich <[email protected]>
> wrote:
> >>>>>>
> >>>>>>> Hey,
> >>>>>>>
> >>>>>>> Thank you guys for chiming in; versioning is for sure something we
> >>> should
> >>>>>>> get to some common ground.
> >>>>>>> Its a triple problem right now; I think we have the following
> things:
> >>>>>>> * storage-api
> >>>>>>> ** we have "2.7.3-SNAPSHOT" in the repo
> >>>>>>> ***
> >>>>>>>
> >>>
> https://github.com/apache/hive/blob/0d1cffffc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27
> >>>>>>> ** meanwhile we already have 2.8.1 released to maven central
> >>>>>>> ***
> >>> https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api
> >>>>>>> * standalone-metastore
> >>>>>>> ** 4.0.0-SNAPSHOT in the repo
> >>>>>>> ** last release is 3.1.2
> >>>>>>> * hive
> >>>>>>> ** 4.0.0-SNAPSHOT in the repo
> >>>>>>> ** last release is 3.1.2
> >>>>>>>
> >>>>>>> Regarding the actual version number I'm not entirely sure where we
> >>> should
> >>>>>>> start the numbering - that's why I was referring to it as Hive-X
> in my
> >>>>>>> first letter.
> >>>>>>>
> >>>>>>> I think the key point here would be to start shipping releases
> >>> regularily
> >>>>>>> and not the actual version number we will use - I'll kinda open to
> any
> >>>>>>> versioning scheme which
> >>>>>>> reflects that this is a newer release than 3.1.2.
> >>>>>>>
> >>>>>>> I could imagine the following ones:
> >>>>>>> (A) start with something less expected; but keep 3 in the prefix to
> >>>>>>> reflect that this is not yet 4.0
> >>>>>>>     I can imagine the following numbers:
> >>>>>>>     3.900.0, 3.901.0, ...
> >>>>>>>     3.9.0, 3.9.1, ...
> >>>>>>> (B) start 4.0.0
> >>>>>>>     4.0.0, 4.1.0, ...
> >>>>>>> (C) jump to some calendar based version number like 2022.2.9
> >>>>>>>     trunk based development has pros and cons...making a move like
> >>> this
> >>>>>>> irreversibly pledges trunk based development; and makes release
> >>> branches
> >>>>>>> hard to introduce
> >>>>>>> (X) somewhat orthogonal is to (also) use some suffixes
> >>>>>>>     4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1
> >>>>>>>     this is probably the most tempting to use - but this versioning
> >>>>>>> schema with a non-changing MINOR and PATCH number will
> >>>>>>>     also suggest that the actual software is fully compatible - and
> >>> only
> >>>>>>> bugs are being fixed - which will not be true...
> >>>>>>>
> >>>>>>> I really like the idea to suffix these releases with alpha or beta
> -
> >>>>>>> which
> >>>>>>> will communicate our level commitment that these are not 100%
> >>> production
> >>>>>>> ready artifacts.
> >>>>>>>
> >>>>>>> I think we could fix HIVE-25665; and probably experiment with
> >>>>>>> 4.0.0-alpha1
> >>>>>>> for start...
> >>>>>>>
> >>>>>>>> This also means there should *not* be a branch-4 after releasing
> Hive
> >>>>>>> 4.0
> >>>>>>>> and let that diverge (and becomes the next, super-ignored
> branch-3),
> >>>>>>> correct; no need to keep a branch we don't maintain...but in any
> case
> >>> I
> >>>>>>> think we can postpone this decision until there will be something
> to
> >>>>>>> release... :)
> >>>>>>>
> >>>>>>> cheers,
> >>>>>>> Zoltan
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2/9/22 10:23 AM, L?szl? Bodor wrote:
> >>>>>>>> Hi All!
> >>>>>>>>
> >>>>>>>> A purely technical question: what will the SNAPSHOT version become
> >>> after
> >>>>>>>> releasing Hive 4.0.0? I think this is important, as it defines and
> >>>>>>> reflects
> >>>>>>>> the future release plans.
> >>>>>>>>
> >>>>>>>> Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 +
> >>> branch-3.
> >>>>>>>> Hive is an evolving and super-active project: if we want to make
> >>> regular
> >>>>>>>> releases, we should simply release Hive 4.0 and bump pom to
> >>>>>>> 4.1.0-SNAPSHOT,
> >>>>>>>> which clearly says that we can release Hive 4.1 anytime we want,
> >>> without
> >>>>>>>> being frustrated about "whether we included enough cool stuff to
> >>> release
> >>>>>>>> 5.0".
> >>>>>>>>
> >>>>>>>> This also means there should *not* be a branch-4 after releasing
> >>> Hive
> >>>>>>>> 4.0
> >>>>>>>> and let that diverge (and becomes the next, super-ignored
> branch-3),
> >>>>>>>> only
> >>>>>>>> when we end up bringing a minor backward-incompatible thing that
> >>> needs a
> >>>>>>>> 4.0.x, and when it happens, we'll create *branch-4.0 *on demand.
> For
> >>> me,
> >>>>>>> a
> >>>>>>>> branch called *branch-4.0* doesn't imply either I can expect cool
> >>>>>>> releases
> >>>>>>>> in the future from there or the branch is maintained and tries to
> be
> >>> in
> >>>>>>>> sync with the *master*.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Laszlo Bodor
> >>>>>>>>
> >>>>>>>> Alessandro Solimando <[email protected]> ezt ?rta
> >>> (id?pont:
> >>>>>>>> 2022. febr. 8., K, 16:42):
> >>>>>>>>
> >>>>>>>>> Hello everyone,
> >>>>>>>>> thank you for starting this discussion.
> >>>>>>>>>
> >>>>>>>>> I agree that releasing the master branch regularly and
> sufficiently
> >>>>>>> often
> >>>>>>>>> is welcome and vital for the health of the community.
> >>>>>>>>>
> >>>>>>>>> It would be great to hear from others too, especially PMC members
> >>> and
> >>>>>>>>> committers, but even simple contributors/followers as myself.
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>> Alessandro
> >>>>>>>>>
> >>>>>>>>> On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis <
> [email protected]
> >>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hello,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for starting the discussion Zoltan.
> >>>>>>>>>>
> >>>>>>>>>> I strongly believe that it is important to have regular and
> often
> >>>>>>>>> releases
> >>>>>>>>>> otherwise people will create and maintain separate Hive forks.
> >>>>>>>>>> The latter is not good for the project and the community may
> lose
> >>>>>>>>> valuable
> >>>>>>>>>> members because of it.
> >>>>>>>>>>
> >>>>>>>>>> Going forward I fully agree that there is no point bringing up
> >>> strong
> >>>>>>>>>> blockers for the next release. For sure there are many backward
> >>>>>>>>>> incompatible changes and possibly unstable features but unless
> we
> >>> get
> >>>>>>>>>> a
> >>>>>>>>>> release out it will be difficult to determine what is broken and
> >>> what
> >>>>>>>>> needs
> >>>>>>>>>> to be fixed.
> >>>>>>>>>>
> >>>>>>>>>> Due to the big number of changes that are going to appear in the
> >>> next
> >>>>>>>>>> version I would suggest using the terms Hive X-alpha, Hive
> X-beta
> >>> for
> >>>>>>> the
> >>>>>>>>>> first few releases. This will make it clear to the end users
> that
> >>> they
> >>>>>>>>> need
> >>>>>>>>>> to be careful when upgrading from an older version and it will
> >>> give us
> >>>>>>> a
> >>>>>>>>>> bit more time and freedom to treat issues that the users will
> >>> likely
> >>>>>>>>>> discover.
> >>>>>>>>>>
> >>>>>>>>>> The only real blocker that we may want to treat is HIVE-25665
> [1]
> >>> but
> >>>>>>> we
> >>>>>>>>>> can continue the discussion under that ticket and re-evaluate if
> >>>>>>>>> necessary,
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Stamatis
> >>>>>>>>>>
> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/HIVE-25665
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich <[email protected]>
> >>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey All,
> >>>>>>>>>>>
> >>>>>>>>>>> We didn't made a release for a long time now; (3.1.2 was
> released
> >>> on
> >>>>>>> 26
> >>>>>>>>>>> August 2019) - and I think because we didn't made that many
> >>> branch-3
> >>>>>>>>>>> releases; not too many fixes
> >>>>>>>>>>> were ported there - which made that release branch kinda erode
> >>> away.
> >>>>>>>>>>>
> >>>>>>>>>>> We have a lot of new features/changes in the current master.
> >>>>>>>>>>> I think instead of aiming for big feature-packed releases we
> >>> should
> >>>>>>> aim
> >>>>>>>>>>> for making a regular release every few months - we should make
> >>>>>>>>>>> regular
> >>>>>>>>>>> releases which people could
> >>>>>>>>>>> install and use.
> >>>>>>>>>>> After all releasing Hive after more than 2 years would be big
> step
> >>>>>>>>>> forward
> >>>>>>>>>>> in itself alone - we have so many improvements that I can't
> even
> >>>>>>>>> count...
> >>>>>>>>>>>
> >>>>>>>>>>> But I may know not every aspects of the project / states of
> some
> >>>>>>>>> internal
> >>>>>>>>>>> features - so I would like to ask you:
> >>>>>>>>>>> What would be the bare minimum requirements before we could
> >>> release
> >>>>>>> the
> >>>>>>>>>>> current master as Hive X?
> >>>>>>>>>>>
> >>>>>>>>>>> There are many nice-to-have-s like:
> >>>>>>>>>>> * hadoop upgrade
> >>>>>>>>>>> * jdk11
> >>>>>>>>>>> * remove HoS or MR
> >>>>>>>>>>> * ?
> >>>>>>>>>>> but I don't think these are blockers...we can make any of these
> >>> in
> >>>>>>>>>>> the
> >>>>>>>>>>> next release if we start making them...
> >>>>>>>>>>>
> >>>>>>>>>>> cheers,
> >>>>>>>>>>> Zoltan
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: Start releasing the master branch

Reply via email to