Hello Hive team,

I wonder if anyone in the Hive team has tried the TPC-DS benchmark on
the master branch recently.  We occasionally run TPC-DS system tests
using the master branch, and the tests don't succeed completely. Here
is how our TPC-DS tests proceed.

1. Compile and run Hive on Tez (not Hive-LLAP)
2. Load ORC tables from 1TB TPC-DS raw text data, and compute statistics
3. Run 99 TPC-DS queries which were slightly modified to return
varying number of rows (rather than 100 rows)
4. Compare the results against the previous results

The previous results were obtained and cross-checked by running Hive
3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their
correctness.

For the latest commit in the master branch, step 2 fails. For earlier
commits (for example, commits in February 2021), step 3 fails where
several queries either fail or return wrong results.

We can compile and report the test results in this mailing list, but
would like to know if similar results have been reproduced by the Hive
team, in order to make sure that we did not make errors in our tests.

If it is okay to open a JIRA ticket that only reports failures in the
TPC-DS test, we could also perform git bi-sect to locate the commit
that begin to generate wrong results.

--- Sungwoo Park

On Tue, 1 Mar 2022, Zoltan Haindrich wrote:

Hey,

Great to hear that we are on the same side regarding these things :)

For around a week now - we have nightly builds for the master branch:
http://ci.hive.apache.org/job/hive-nightly/12/

I think we have 1 blocker issue:
https://issues.apache.org/jira/browse/HIVE-25665

I know about one more thing I would rather get fixed before we release it:
https://issues.apache.org/jira/browse/HIVE-25994
The best would be to introduce smoke tests (HIVE-22302) to ensure that something like this will not happen in the future - but we should probably start moving forward.

I think we could call the first iteration of this as "4.0.0-alpha-1" :)

I've added 4.0.0-alpha-1 as a version - and added the above two ticket to it.
https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%204.0.0-alpha-1

Are there any more things you guys know which would be needed?

cheers,
Zoltan


On 2/22/22 12:18 PM, Peter Vary wrote:
I would vote for 4.0.0-alpha-1 or similar for all of the components.

When we have more stable releases I would keep the 4.x.x schema, since everyone is familiar with it, and I do not see a really good reason to change it.

Thanks,
Peter


On 2022. Feb 10., at 3:34, Szehon Ho <szehon.apa...@gmail.com> wrote:

+1 that would be awesome to see Hive master released after so long.

Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would pick
any 3.x or calendar date (which could tend to slip and be more confusing?).

Thanks in any case to get the ball rolling.
Szehon

On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich <k...@rxd.hu> wrote:

Hey,

Thank you guys for chiming in; versioning is for sure something we should
get to some common ground.
Its a triple problem right now; I think we have the following things:
* storage-api
** we have "2.7.3-SNAPSHOT" in the repo
***
https://github.com/apache/hive/blob/0d1cffffc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27
** meanwhile we already have 2.8.1 released to maven central
*** https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api
* standalone-metastore
** 4.0.0-SNAPSHOT in the repo
** last release is 3.1.2
* hive
** 4.0.0-SNAPSHOT in the repo
** last release is 3.1.2

Regarding the actual version number I'm not entirely sure where we should
start the numbering - that's why I was referring to it as Hive-X in my
first letter.

I think the key point here would be to start shipping releases regularily
and not the actual version number we will use - I'll kinda open to any
versioning scheme which
reflects that this is a newer release than 3.1.2.

I could imagine the following ones:
(A) start with something less expected; but keep 3 in the prefix to
reflect that this is not yet 4.0
     I can imagine the following numbers:
     3.900.0, 3.901.0, ...
     3.9.0, 3.9.1, ...
(B) start 4.0.0
     4.0.0, 4.1.0, ...
(C) jump to some calendar based version number like 2022.2.9
     trunk based development has pros and cons...making a move like this
irreversibly pledges trunk based development; and makes release branches
hard to introduce
(X) somewhat orthogonal is to (also) use some suffixes
     4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1
     this is probably the most tempting to use - but this versioning
schema with a non-changing MINOR and PATCH number will
     also suggest that the actual software is fully compatible - and only
bugs are being fixed - which will not be true...

I really like the idea to suffix these releases with alpha or beta - which
will communicate our level commitment that these are not 100% production
ready artifacts.

I think we could fix HIVE-25665; and probably experiment with 4.0.0-alpha1
for start...

This also means there should *not* be a branch-4 after releasing Hive
4.0
and let that diverge (and becomes the next, super-ignored branch-3),
correct; no need to keep a branch we don't maintain...but in any case I
think we can postpone this decision until there will be something to
release... :)

cheers,
Zoltan



On 2/9/22 10:23 AM, L?szl? Bodor wrote:
Hi All!

A purely technical question: what will the SNAPSHOT version become after
releasing Hive 4.0.0? I think this is important, as it defines and
reflects
the future release plans.

Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 + branch-3.
Hive is an evolving and super-active project: if we want to make regular
releases, we should simply release Hive 4.0 and bump pom to
4.1.0-SNAPSHOT,
which clearly says that we can release Hive 4.1 anytime we want, without
being frustrated about "whether we included enough cool stuff to release
5.0".

This also means there should *not* be a branch-4 after releasing Hive 4.0 and let that diverge (and becomes the next, super-ignored branch-3), only
when we end up bringing a minor backward-incompatible thing that needs a
4.0.x, and when it happens, we'll create *branch-4.0 *on demand. For me,
a
branch called *branch-4.0* doesn't imply either I can expect cool
releases
in the future from there or the branch is maintained and tries to be in
sync with the *master*.

Regards,
Laszlo Bodor

Alessandro Solimando <alessandro.solima...@gmail.com> ezt ?rta (id?pont:
2022. febr. 8., K, 16:42):

Hello everyone,
thank you for starting this discussion.

I agree that releasing the master branch regularly and sufficiently
often
is welcome and vital for the health of the community.

It would be great to hear from others too, especially PMC members and
committers, but even simple contributors/followers as myself.

Best regards,
Alessandro

On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis <zabe...@gmail.com>
wrote:

Hello,

Thanks for starting the discussion Zoltan.

I strongly believe that it is important to have regular and often
releases
otherwise people will create and maintain separate Hive forks.
The latter is not good for the project and the community may lose
valuable
members because of it.

Going forward I fully agree that there is no point bringing up strong
blockers for the next release. For sure there are many backward
incompatible changes and possibly unstable features but unless we get a
release out it will be difficult to determine what is broken and what
needs
to be fixed.

Due to the big number of changes that are going to appear in the next
version I would suggest using the terms Hive X-alpha, Hive X-beta for
the
first few releases. This will make it clear to the end users that they
need
to be careful when upgrading from an older version and it will give us
a
bit more time and freedom to treat issues that the users will likely
discover.

The only real blocker that we may want to treat is HIVE-25665 [1] but
we
can continue the discussion under that ticket and re-evaluate if
necessary,

Best,
Stamatis

[1] https://issues.apache.org/jira/browse/HIVE-25665


On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich <k...@rxd.hu> wrote:

Hey All,

We didn't made a release for a long time now; (3.1.2 was released on
26
August 2019) - and I think because we didn't made that many branch-3
releases; not too many fixes
were ported there - which made that release branch kinda erode away.

We have a lot of new features/changes in the current master.
I think instead of aiming for big feature-packed releases we should
aim
for making a regular release every few months - we should make regular
releases which people could
install and use.
After all releasing Hive after more than 2 years would be big step
forward
in itself alone - we have so many improvements that I can't even
count...

But I may know not every aspects of the project / states of some
internal
features - so I would like to ask you:
What would be the bare minimum requirements before we could release
the
current master as Hive X?

There are many nice-to-have-s like:
* hadoop upgrade
* jdk11
* remove HoS or MR
* ?
but I don't think these are blockers...we can make any of these in the
next release if we start making them...

cheers,
Zoltan







Reply via email to