Re: Start releasing the master branch

Peter Vary Wed, 02 Mar 2022 00:44:50 -0800

Hi Team,

Could we create tickets for the issues?
I think it would be good to collect the issues/potential blockers in the jira 
instead of having a complicated mail thread.


If we set the target version to 4.0.0-alpha-1, then we can easily use the 
following filter to see the status of the tasks:
https://issues.apache.org/jira/issues/?jql=project%3D%22HIVE%22%20AND%20%22Target%20Version%2Fs%22%3D%224.0.0-alpha-1%22
 
<https://issues.apache.org/jira/issues/?jql=project=%22HIVE%22%20AND%20%22Target%20Version/s%22=%224.0.0-alpha-1%22>

@Stamatis: Sadly I have missed your letter/jira and created my own with the fix 
for building from the src package: 
https://issues.apache.org/jira/browse/HIVE-25997 
<https://issues.apache.org/jira/browse/HIVE-25997>
If you have time, I would like to ask you to review.

If anyone knows of any blocker I would like to ask them to create a jira for 
that too.

Thanks,
Peter


> On 2022. Mar 2., at 7:04, Sungwoo Park <c...@pl.postech.ac.kr> wrote:
> 
> Hello Alessandro,
> 
> For the latest commit, loading ORC tables fails (with the log message shown 
> below). Let me try to find a commit that introduces this bug and create a 
> JIRA ticket.
> 
> --- Sungwoo
> 
> 2022-03-02 05:41:56,578 ERROR [Thread-73] exec.StatsTask: Failed to run stats 
> task
> java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: Input 
> path does not exist: 
> hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:622)
>  at 
> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.constructColumnStatsFromPackedRows(ColStatsProcessor.java:105)
>  at 
> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:200)
>  at 
> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:93)
>  at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212)
>  at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
>  at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:83)
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
>  at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
>  at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
>  at 
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
>  at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.generateWrappedSplits(FetchOperator.java:435)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:402)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:306)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560)
>  ... 7 more
> 
> On Tue, 1 Mar 2022, Alessandro Solimando wrote:
> 
>> Hi Sungwoo,
>> last time I tried to run TPCDS-based benchmark I stumbled upon a similar
>> situation, finally I found that statistics were not computed, so CBO was
>> not kicking in, and the automatic retry goes with CBO off which was failing
>> for something like 10 queries (subqueries cannot be decorrelated, but also
>> some runtime errors).
>> 
>> Making sure that (column) statistics were correctly computed fixed the
>> problem.
>> 
>> Can you check if this is the case for you?
>> 
>> HTH,
>> Alessandro
>> 
>> On Tue, 1 Mar 2022 at 15:28, POSTECH CT <c...@pl.postech.ac.kr> wrote:
>> 
>>> Hello Hive team,
>>> 
>>> I wonder if anyone in the Hive team has tried the TPC-DS benchmark on
>>> the master branch recently.  We occasionally run TPC-DS system tests
>>> using the master branch, and the tests don't succeed completely. Here
>>> is how our TPC-DS tests proceed.
>>> 
>>> 1. Compile and run Hive on Tez (not Hive-LLAP)
>>> 2. Load ORC tables from 1TB TPC-DS raw text data, and compute statistics
>>> 3. Run 99 TPC-DS queries which were slightly modified to return
>>> varying number of rows (rather than 100 rows)
>>> 4. Compare the results against the previous results
>>> 
>>> The previous results were obtained and cross-checked by running Hive
>>> 3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their
>>> correctness.
>>> 
>>> For the latest commit in the master branch, step 2 fails. For earlier
>>> commits (for example, commits in February 2021), step 3 fails where
>>> several queries either fail or return wrong results.
>>> 
>>> We can compile and report the test results in this mailing list, but
>>> would like to know if similar results have been reproduced by the Hive
>>> team, in order to make sure that we did not make errors in our tests.
>>> 
>>> If it is okay to open a JIRA ticket that only reports failures in the
>>> TPC-DS test, we could also perform git bi-sect to locate the commit
>>> that begin to generate wrong results.
>>> 
>>> --- Sungwoo Park
>>> 
>>> On Tue, 1 Mar 2022, Zoltan Haindrich wrote:
>>> 
>>>> Hey,
>>>> 
>>>> Great to hear that we are on the same side regarding these things :)
>>>> 
>>>> For around a week now - we have nightly builds for the master branch:
>>>> http://ci.hive.apache.org/job/hive-nightly/12/
>>>> 
>>>> I think we have 1 blocker issue:
>>>> https://issues.apache.org/jira/browse/HIVE-25665
>>>> 
>>>> I know about one more thing I would rather get fixed before we release
>>> it:
>>>> https://issues.apache.org/jira/browse/HIVE-25994
>>>> The best would be to introduce smoke tests (HIVE-22302) to ensure that
>>>> something like this will not happen in the future - but we should
>>> probably
>>>> start moving forward.
>>>> 
>>>> I think we could call the first iteration of this as "4.0.0-alpha-1" :)
>>>> 
>>>> I've added 4.0.0-alpha-1 as a version - and added the above two ticket
>>> to it.
>>>> 
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%204.0.0-alpha-1
>>>> 
>>>> Are there any more things you guys know which would be needed?
>>>> 
>>>> cheers,
>>>> Zoltan
>>>> 
>>>> 
>>>> On 2/22/22 12:18 PM, Peter Vary wrote:
>>>>> I would vote for 4.0.0-alpha-1 or similar for all of the components.
>>>>> 
>>>>> When we have more stable releases I would keep the 4.x.x schema, since
>>>>> everyone is familiar with it, and I do not see a really good reason to
>>>>> change it.
>>>>> 
>>>>> Thanks,
>>>>> Peter
>>>>> 
>>>>> 
>>>>>> On 2022. Feb 10., at 3:34, Szehon Ho <szehon.apa...@gmail.com> wrote:
>>>>>> 
>>>>>> +1 that would be awesome to see Hive master released after so long.
>>>>>> 
>>>>>> Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would
>>> pick
>>>>>> any 3.x or calendar date (which could tend to slip and be more
>>>>>> confusing?).
>>>>>> 
>>>>>> Thanks in any case to get the ball rolling.
>>>>>> Szehon
>>>>>> 
>>>>>> On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich <k...@rxd.hu> wrote:
>>>>>> 
>>>>>>> Hey,
>>>>>>> 
>>>>>>> Thank you guys for chiming in; versioning is for sure something we
>>> should
>>>>>>> get to some common ground.
>>>>>>> Its a triple problem right now; I think we have the following things:
>>>>>>> * storage-api
>>>>>>> ** we have "2.7.3-SNAPSHOT" in the repo
>>>>>>> ***
>>>>>>> 
>>> https://github.com/apache/hive/blob/0d1cffffc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27
>>>>>>> ** meanwhile we already have 2.8.1 released to maven central
>>>>>>> ***
>>> https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api
>>>>>>> * standalone-metastore
>>>>>>> ** 4.0.0-SNAPSHOT in the repo
>>>>>>> ** last release is 3.1.2
>>>>>>> * hive
>>>>>>> ** 4.0.0-SNAPSHOT in the repo
>>>>>>> ** last release is 3.1.2
>>>>>>> 
>>>>>>> Regarding the actual version number I'm not entirely sure where we
>>> should
>>>>>>> start the numbering - that's why I was referring to it as Hive-X in my
>>>>>>> first letter.
>>>>>>> 
>>>>>>> I think the key point here would be to start shipping releases
>>> regularily
>>>>>>> and not the actual version number we will use - I'll kinda open to any
>>>>>>> versioning scheme which
>>>>>>> reflects that this is a newer release than 3.1.2.
>>>>>>> 
>>>>>>> I could imagine the following ones:
>>>>>>> (A) start with something less expected; but keep 3 in the prefix to
>>>>>>> reflect that this is not yet 4.0
>>>>>>>     I can imagine the following numbers:
>>>>>>>     3.900.0, 3.901.0, ...
>>>>>>>     3.9.0, 3.9.1, ...
>>>>>>> (B) start 4.0.0
>>>>>>>     4.0.0, 4.1.0, ...
>>>>>>> (C) jump to some calendar based version number like 2022.2.9
>>>>>>>     trunk based development has pros and cons...making a move like
>>> this
>>>>>>> irreversibly pledges trunk based development; and makes release
>>> branches
>>>>>>> hard to introduce
>>>>>>> (X) somewhat orthogonal is to (also) use some suffixes
>>>>>>>     4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1
>>>>>>>     this is probably the most tempting to use - but this versioning
>>>>>>> schema with a non-changing MINOR and PATCH number will
>>>>>>>     also suggest that the actual software is fully compatible - and
>>> only
>>>>>>> bugs are being fixed - which will not be true...
>>>>>>> 
>>>>>>> I really like the idea to suffix these releases with alpha or beta -
>>>>>>> which
>>>>>>> will communicate our level commitment that these are not 100%
>>> production
>>>>>>> ready artifacts.
>>>>>>> 
>>>>>>> I think we could fix HIVE-25665; and probably experiment with
>>>>>>> 4.0.0-alpha1
>>>>>>> for start...
>>>>>>> 
>>>>>>>> This also means there should *not* be a branch-4 after releasing Hive
>>>>>>> 4.0
>>>>>>>> and let that diverge (and becomes the next, super-ignored branch-3),
>>>>>>> correct; no need to keep a branch we don't maintain...but in any case
>>> I
>>>>>>> think we can postpone this decision until there will be something to
>>>>>>> release... :)
>>>>>>> 
>>>>>>> cheers,
>>>>>>> Zoltan
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 2/9/22 10:23 AM, L?szl? Bodor wrote:
>>>>>>>> Hi All!
>>>>>>>> 
>>>>>>>> A purely technical question: what will the SNAPSHOT version become
>>> after
>>>>>>>> releasing Hive 4.0.0? I think this is important, as it defines and
>>>>>>> reflects
>>>>>>>> the future release plans.
>>>>>>>> 
>>>>>>>> Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 +
>>> branch-3.
>>>>>>>> Hive is an evolving and super-active project: if we want to make
>>> regular
>>>>>>>> releases, we should simply release Hive 4.0 and bump pom to
>>>>>>> 4.1.0-SNAPSHOT,
>>>>>>>> which clearly says that we can release Hive 4.1 anytime we want,
>>> without
>>>>>>>> being frustrated about "whether we included enough cool stuff to
>>> release
>>>>>>>> 5.0".
>>>>>>>> 
>>>>>>>> This also means there should *not* be a branch-4 after releasing
>>> Hive
>>>>>>>> 4.0
>>>>>>>> and let that diverge (and becomes the next, super-ignored branch-3),
>>>>>>>> only
>>>>>>>> when we end up bringing a minor backward-incompatible thing that
>>> needs a
>>>>>>>> 4.0.x, and when it happens, we'll create *branch-4.0 *on demand. For
>>> me,
>>>>>>> a
>>>>>>>> branch called *branch-4.0* doesn't imply either I can expect cool
>>>>>>> releases
>>>>>>>> in the future from there or the branch is maintained and tries to be
>>> in
>>>>>>>> sync with the *master*.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Laszlo Bodor
>>>>>>>> 
>>>>>>>> Alessandro Solimando <alessandro.solima...@gmail.com> ezt ?rta
>>> (id?pont:
>>>>>>>> 2022. febr. 8., K, 16:42):
>>>>>>>> 
>>>>>>>>> Hello everyone,
>>>>>>>>> thank you for starting this discussion.
>>>>>>>>> 
>>>>>>>>> I agree that releasing the master branch regularly and sufficiently
>>>>>>> often
>>>>>>>>> is welcome and vital for the health of the community.
>>>>>>>>> 
>>>>>>>>> It would be great to hear from others too, especially PMC members
>>> and
>>>>>>>>> committers, but even simple contributors/followers as myself.
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Alessandro
>>>>>>>>> 
>>>>>>>>> On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis <zabe...@gmail.com
>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> Thanks for starting the discussion Zoltan.
>>>>>>>>>> 
>>>>>>>>>> I strongly believe that it is important to have regular and often
>>>>>>>>> releases
>>>>>>>>>> otherwise people will create and maintain separate Hive forks.
>>>>>>>>>> The latter is not good for the project and the community may lose
>>>>>>>>> valuable
>>>>>>>>>> members because of it.
>>>>>>>>>> 
>>>>>>>>>> Going forward I fully agree that there is no point bringing up
>>> strong
>>>>>>>>>> blockers for the next release. For sure there are many backward
>>>>>>>>>> incompatible changes and possibly unstable features but unless we
>>> get
>>>>>>>>>> a
>>>>>>>>>> release out it will be difficult to determine what is broken and
>>> what
>>>>>>>>> needs
>>>>>>>>>> to be fixed.
>>>>>>>>>> 
>>>>>>>>>> Due to the big number of changes that are going to appear in the
>>> next
>>>>>>>>>> version I would suggest using the terms Hive X-alpha, Hive X-beta
>>> for
>>>>>>> the
>>>>>>>>>> first few releases. This will make it clear to the end users that
>>> they
>>>>>>>>> need
>>>>>>>>>> to be careful when upgrading from an older version and it will
>>> give us
>>>>>>> a
>>>>>>>>>> bit more time and freedom to treat issues that the users will
>>> likely
>>>>>>>>>> discover.
>>>>>>>>>> 
>>>>>>>>>> The only real blocker that we may want to treat is HIVE-25665 [1]
>>> but
>>>>>>> we
>>>>>>>>>> can continue the discussion under that ticket and re-evaluate if
>>>>>>>>> necessary,
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Stamatis
>>>>>>>>>> 
>>>>>>>>>> [1] https://issues.apache.org/jira/browse/HIVE-25665
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich <k...@rxd.hu>
>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hey All,
>>>>>>>>>>> 
>>>>>>>>>>> We didn't made a release for a long time now; (3.1.2 was released
>>> on
>>>>>>> 26
>>>>>>>>>>> August 2019) - and I think because we didn't made that many
>>> branch-3
>>>>>>>>>>> releases; not too many fixes
>>>>>>>>>>> were ported there - which made that release branch kinda erode
>>> away.
>>>>>>>>>>> 
>>>>>>>>>>> We have a lot of new features/changes in the current master.
>>>>>>>>>>> I think instead of aiming for big feature-packed releases we
>>> should
>>>>>>> aim
>>>>>>>>>>> for making a regular release every few months - we should make
>>>>>>>>>>> regular
>>>>>>>>>>> releases which people could
>>>>>>>>>>> install and use.
>>>>>>>>>>> After all releasing Hive after more than 2 years would be big step
>>>>>>>>>> forward
>>>>>>>>>>> in itself alone - we have so many improvements that I can't even
>>>>>>>>> count...
>>>>>>>>>>> 
>>>>>>>>>>> But I may know not every aspects of the project / states of some
>>>>>>>>> internal
>>>>>>>>>>> features - so I would like to ask you:
>>>>>>>>>>> What would be the bare minimum requirements before we could
>>> release
>>>>>>> the
>>>>>>>>>>> current master as Hive X?
>>>>>>>>>>> 
>>>>>>>>>>> There are many nice-to-have-s like:
>>>>>>>>>>> * hadoop upgrade
>>>>>>>>>>> * jdk11
>>>>>>>>>>> * remove HoS or MR
>>>>>>>>>>> * ?
>>>>>>>>>>> but I don't think these are blockers...we can make any of these
>>> in
>>>>>>>>>>> the
>>>>>>>>>>> next release if we start making them...
>>>>>>>>>>> 
>>>>>>>>>>> cheers,
>>>>>>>>>>> Zoltan
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Start releasing the master branch

Reply via email to