Re: Start releasing the master branch

Zoltan Haindrich Wed, 02 Mar 2022 02:01:52 -0800

Hey,

regarding 4.0.0 / 4.0.0-alpha-1 target/fix versions in the jira:
* I think we should change all already resolved tickets with fix version 4.0.0 
to have fix version 4.0.0-alpha-1

** this could be postponed until we are actually releasing the thing as I think everyone committing to the master is entering 4.0.0 as fix version without muchaftertought...this could probably change after we get the first release out.

* regarding the the existing tickets with fix version/target version 4.0.0 - I 
think that would be a bit too much (>200 tickets)
** some numbers:
*** 239 tickets open now
*** 224 was not updated in the last 90 days
*** 216 was not changed in the last 180 days
*** 178 was not updated in the last 360 days

** as a matter of fact I think many of these tickets shouldn't even have a target or fix version - and most of them should be unassigned...I don't want to get lost in thisright now...I think for now we should keep the scope small and only care with 4.0.0-alpha-1 tickets


https://issues.apache.org/jira/issues/?jql=project%20%3D%20hive%20and%20resolutiondate%20%20is%20empty%20and%20(fixVersion%20%20in%20(%274.0.0%27)%20or%20cf%5B12310320%5D%20%20in%20(%274.0.0%27))

I think for faster communication regarding these things we could also utilize 
the #hive channel on the ASF slack - what do you guys think?

cheers,
Zoltan

On 3/2/22 9:51 AM, Stamatis Zampetakis wrote:

Agree with Peter, creating JIRAs is the way to go.

Putting the appropriate priority (e.g., BLOCKER) and version (4.0.0 or
4.0.0-alpha-1) when creating the JIRA should be enough to keep us on track.
I am mentioning both 4.0.0 and 4.0.0-alpha-1 because eventually I think we
are gonna move everything with target 4.0.0 to 4.0.0-alpha-1.

On Wed, Mar 2, 2022 at 9:37 AM Peter Vary <[email protected]>
wrote:

Hi Team,

Could we create tickets for the issues?
I think it would be good to collect the issues/potential blockers in the
jira instead of having a complicated mail thread.

If we set the target version to 4.0.0-alpha-1, then we can easily use the
following filter to see the status of the tasks:

https://issues.apache.org/jira/issues/?jql=project%3D%22HIVE%22%20AND%20%22Target%20Version%2Fs%22%3D%224.0.0-alpha-1%22
<
https://issues.apache.org/jira/issues/?jql=project=%22HIVE%22%20AND%20%22Target%20Version/s%22=%224.0.0-alpha-1%22


@Stamatis: Sadly I have missed your letter/jira and created my own with
the fix for building from the src package:
https://issues.apache.org/jira/browse/HIVE-25997 <
https://issues.apache.org/jira/browse/HIVE-25997>
If you have time, I would like to ask you to review.

If anyone knows of any blocker I would like to ask them to create a jira
for that too.

Thanks,
Peter

On 2022. Mar 2., at 7:04, Sungwoo Park <[email protected]> wrote:

Hello Alessandro,

For the latest commit, loading ORC tables fails (with the log message

shown below). Let me try to find a commit that introduces this bug and
create a JIRA ticket.


--- Sungwoo

2022-03-02 05:41:56,578 ERROR [Thread-73] exec.StatsTask: Failed to run

stats task

java.io.IOException: org.apache.hadoop.mapred.InvalidInputException:

Input path does not exist:
hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001

at

org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:622)

at

org.apache.hadoop.hive.ql.stats.ColStatsProcessor.constructColumnStatsFromPackedRows(ColStatsProcessor.java:105)

at

org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:200)

at

org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:93)

  at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212)
  at

org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)

  at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:83)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path

does not exist:
hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001

at

org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)

at

org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)

at

org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)

at

org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)

at

org.apache.hadoop.hive.ql.exec.FetchOperator.generateWrappedSplits(FetchOperator.java:435)

at

org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:402)

at

org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:306)

at

org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560)

  ... 7 more

On Tue, 1 Mar 2022, Alessandro Solimando wrote:

Hi Sungwoo,
last time I tried to run TPCDS-based benchmark I stumbled upon a similar
situation, finally I found that statistics were not computed, so CBO was
not kicking in, and the automatic retry goes with CBO off which was

failing

for something like 10 queries (subqueries cannot be decorrelated, but

also

some runtime errors).

Making sure that (column) statistics were correctly computed fixed the
problem.

Can you check if this is the case for you?

HTH,
Alessandro

On Tue, 1 Mar 2022 at 15:28, POSTECH CT <[email protected]> wrote:

Hello Hive team,

I wonder if anyone in the Hive team has tried the TPC-DS benchmark on
the master branch recently.  We occasionally run TPC-DS system tests
using the master branch, and the tests don't succeed completely. Here
is how our TPC-DS tests proceed.

1. Compile and run Hive on Tez (not Hive-LLAP)
2. Load ORC tables from 1TB TPC-DS raw text data, and compute

statistics

3. Run 99 TPC-DS queries which were slightly modified to return
varying number of rows (rather than 100 rows)
4. Compare the results against the previous results

The previous results were obtained and cross-checked by running Hive
3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their
correctness.

For the latest commit in the master branch, step 2 fails. For earlier
commits (for example, commits in February 2021), step 3 fails where
several queries either fail or return wrong results.

We can compile and report the test results in this mailing list, but
would like to know if similar results have been reproduced by the Hive
team, in order to make sure that we did not make errors in our tests.

If it is okay to open a JIRA ticket that only reports failures in the
TPC-DS test, we could also perform git bi-sect to locate the commit
that begin to generate wrong results.

--- Sungwoo Park

On Tue, 1 Mar 2022, Zoltan Haindrich wrote:

Hey,

Great to hear that we are on the same side regarding these things :)

For around a week now - we have nightly builds for the master branch:
http://ci.hive.apache.org/job/hive-nightly/12/

I think we have 1 blocker issue:
https://issues.apache.org/jira/browse/HIVE-25665

I know about one more thing I would rather get fixed before we release

it:

https://issues.apache.org/jira/browse/HIVE-25994
The best would be to introduce smoke tests (HIVE-22302) to ensure that
something like this will not happen in the future - but we should

probably

start moving forward.

I think we could call the first iteration of this as "4.0.0-alpha-1"

:)


I've added 4.0.0-alpha-1 as a version - and added the above two ticket

to it.

https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%204.0.0-alpha-1


Are there any more things you guys know which would be needed?

cheers,
Zoltan


On 2/22/22 12:18 PM, Peter Vary wrote:

I would vote for 4.0.0-alpha-1 or similar for all of the components.

When we have more stable releases I would keep the 4.x.x schema,

since

everyone is familiar with it, and I do not see a really good reason

to

change it.

Thanks,
Peter

On 2022. Feb 10., at 3:34, Szehon Ho <[email protected]>

wrote:


+1 that would be awesome to see Hive master released after so long.

Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would

pick

any 3.x or calendar date (which could tend to slip and be more
confusing?).

Thanks in any case to get the ball rolling.
Szehon

On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich <[email protected]>

wrote:

Hey,

Thank you guys for chiming in; versioning is for sure something we

should

get to some common ground.
Its a triple problem right now; I think we have the following

things:

* storage-api
** we have "2.7.3-SNAPSHOT" in the repo
***

https://github.com/apache/hive/blob/0d1cffffc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27

** meanwhile we already have 2.8.1 released to maven central
***

https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api

* standalone-metastore
** 4.0.0-SNAPSHOT in the repo
** last release is 3.1.2
* hive
** 4.0.0-SNAPSHOT in the repo
** last release is 3.1.2

Regarding the actual version number I'm not entirely sure where we

should

start the numbering - that's why I was referring to it as Hive-X

in my

first letter.

I think the key point here would be to start shipping releases

regularily

and not the actual version number we will use - I'll kinda open to

any

versioning scheme which
reflects that this is a newer release than 3.1.2.

I could imagine the following ones:
(A) start with something less expected; but keep 3 in the prefix to
reflect that this is not yet 4.0
     I can imagine the following numbers:
     3.900.0, 3.901.0, ...
     3.9.0, 3.9.1, ...
(B) start 4.0.0
     4.0.0, 4.1.0, ...
(C) jump to some calendar based version number like 2022.2.9
     trunk based development has pros and cons...making a move like

this

irreversibly pledges trunk based development; and makes release

branches

hard to introduce
(X) somewhat orthogonal is to (also) use some suffixes
     4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1
     this is probably the most tempting to use - but this versioning
schema with a non-changing MINOR and PATCH number will
     also suggest that the actual software is fully compatible - and

only

bugs are being fixed - which will not be true...

I really like the idea to suffix these releases with alpha or beta

which
will communicate our level commitment that these are not 100%

production

ready artifacts.

I think we could fix HIVE-25665; and probably experiment with
4.0.0-alpha1
for start...

This also means there should *not* be a branch-4 after releasing

Hive

4.0

and let that diverge (and becomes the next, super-ignored

branch-3),

correct; no need to keep a branch we don't maintain...but in any

case

think we can postpone this decision until there will be something

to

release... :)

cheers,
Zoltan



On 2/9/22 10:23 AM, L?szl? Bodor wrote:

Hi All!

A purely technical question: what will the SNAPSHOT version become

after

releasing Hive 4.0.0? I think this is important, as it defines and

reflects

the future release plans.

Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 +

branch-3.

Hive is an evolving and super-active project: if we want to make

regular

releases, we should simply release Hive 4.0 and bump pom to

4.1.0-SNAPSHOT,

which clearly says that we can release Hive 4.1 anytime we want,

without

being frustrated about "whether we included enough cool stuff to

release

5.0".

This also means there should *not* be a branch-4 after releasing

Hive

4.0
and let that diverge (and becomes the next, super-ignored

branch-3),

only
when we end up bringing a minor backward-incompatible thing that

needs a

4.0.x, and when it happens, we'll create *branch-4.0 *on demand.

For

me,

branch called *branch-4.0* doesn't imply either I can expect cool

releases

in the future from there or the branch is maintained and tries to

be

in

sync with the *master*.

Regards,
Laszlo Bodor

Alessandro Solimando <[email protected]> ezt ?rta

(id?pont:

2022. febr. 8., K, 16:42):

Hello everyone,
thank you for starting this discussion.

I agree that releasing the master branch regularly and

sufficiently

often

is welcome and vital for the health of the community.

It would be great to hear from others too, especially PMC members

and

committers, but even simple contributors/followers as myself.

Best regards,
Alessandro

On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis <

[email protected]

wrote:

Hello,

Thanks for starting the discussion Zoltan.

I strongly believe that it is important to have regular and

often

releases

otherwise people will create and maintain separate Hive forks.
The latter is not good for the project and the community may

lose

valuable

members because of it.

Going forward I fully agree that there is no point bringing up

strong

blockers for the next release. For sure there are many backward
incompatible changes and possibly unstable features but unless

we

get

a
release out it will be difficult to determine what is broken and

what

needs

to be fixed.

Due to the big number of changes that are going to appear in the

next

version I would suggest using the terms Hive X-alpha, Hive

X-beta

for

the

first few releases. This will make it clear to the end users

that

they

need

to be careful when upgrading from an older version and it will

give us

bit more time and freedom to treat issues that the users will

likely

discover.

The only real blocker that we may want to treat is HIVE-25665

[1]

but

we

can continue the discussion under that ticket and re-evaluate if

necessary,


Best,
Stamatis

[1] https://issues.apache.org/jira/browse/HIVE-25665


On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich <[email protected]>

wrote:

Hey All,

We didn't made a release for a long time now; (3.1.2 was

released

on

August 2019) - and I think because we didn't made that many

branch-3

releases; not too many fixes
were ported there - which made that release branch kinda erode

away.


We have a lot of new features/changes in the current master.
I think instead of aiming for big feature-packed releases we

should

aim

for making a regular release every few months - we should make
regular
releases which people could
install and use.
After all releasing Hive after more than 2 years would be big

step

forward

in itself alone - we have so many improvements that I can't

even

count...


But I may know not every aspects of the project / states of

some

internal

features - so I would like to ask you:
What would be the bare minimum requirements before we could

release

the

current master as Hive X?

There are many nice-to-have-s like:
* hadoop upgrade
* jdk11
* remove HoS or MR
* ?
but I don't think these are blockers...we can make any of these

in

the
next release if we start making them...

cheers,
Zoltan

Re: Start releasing the master branch

Reply via email to