Hey,

regarding 4.0.0 / 4.0.0-alpha-1 target/fix versions in the jira:
* I think we should change all already resolved tickets with fix version 4.0.0 
to have fix version 4.0.0-alpha-1
** this could be postponed until we are actually releasing the thing as I think everyone committing to the master is entering 4.0.0 as fix version without much aftertought...this could probably change after we get the first release out.
* regarding the the existing tickets with fix version/target version 4.0.0 - I 
think that would be a bit too much (>200 tickets)
** some numbers:
*** 239 tickets open now
*** 224 was not updated in the last 90 days
*** 216 was not changed in the last 180 days
*** 178 was not updated in the last 360 days
** as a matter of fact I think many of these tickets shouldn't even have a target or fix version - and most of them should be unassigned...I don't want to get lost in this right now...I think for now we should keep the scope small and only care with 4.0.0-alpha-1 tickets

https://issues.apache.org/jira/issues/?jql=project%20%3D%20hive%20and%20resolutiondate%20%20is%20empty%20and%20(fixVersion%20%20in%20(%274.0.0%27)%20or%20cf%5B12310320%5D%20%20in%20(%274.0.0%27))

I think for faster communication regarding these things we could also utilize 
the #hive channel on the ASF slack - what do you guys think?

cheers,
Zoltan

On 3/2/22 9:51 AM, Stamatis Zampetakis wrote:
Agree with Peter, creating JIRAs is the way to go.

Putting the appropriate priority (e.g., BLOCKER) and version (4.0.0 or
4.0.0-alpha-1) when creating the JIRA should be enough to keep us on track.
I am mentioning both 4.0.0 and 4.0.0-alpha-1 because eventually I think we
are gonna move everything with target 4.0.0 to 4.0.0-alpha-1.

On Wed, Mar 2, 2022 at 9:37 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

Hi Team,

Could we create tickets for the issues?
I think it would be good to collect the issues/potential blockers in the
jira instead of having a complicated mail thread.

If we set the target version to 4.0.0-alpha-1, then we can easily use the
following filter to see the status of the tasks:

https://issues.apache.org/jira/issues/?jql=project%3D%22HIVE%22%20AND%20%22Target%20Version%2Fs%22%3D%224.0.0-alpha-1%22
<
https://issues.apache.org/jira/issues/?jql=project=%22HIVE%22%20AND%20%22Target%20Version/s%22=%224.0.0-alpha-1%22


@Stamatis: Sadly I have missed your letter/jira and created my own with
the fix for building from the src package:
https://issues.apache.org/jira/browse/HIVE-25997 <
https://issues.apache.org/jira/browse/HIVE-25997>
If you have time, I would like to ask you to review.

If anyone knows of any blocker I would like to ask them to create a jira
for that too.

Thanks,
Peter


On 2022. Mar 2., at 7:04, Sungwoo Park <c...@pl.postech.ac.kr> wrote:

Hello Alessandro,

For the latest commit, loading ORC tables fails (with the log message
shown below). Let me try to find a commit that introduces this bug and
create a JIRA ticket.

--- Sungwoo

2022-03-02 05:41:56,578 ERROR [Thread-73] exec.StatsTask: Failed to run
stats task
java.io.IOException: org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
  at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:622)
  at
org.apache.hadoop.hive.ql.stats.ColStatsProcessor.constructColumnStatsFromPackedRows(ColStatsProcessor.java:105)
  at
org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:200)
  at
org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:93)
  at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212)
  at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
  at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:83)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path
does not exist:
hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
  at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
  at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
  at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
  at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
  at
org.apache.hadoop.hive.ql.exec.FetchOperator.generateWrappedSplits(FetchOperator.java:435)
  at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:402)
  at
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:306)
  at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560)
  ... 7 more

On Tue, 1 Mar 2022, Alessandro Solimando wrote:

Hi Sungwoo,
last time I tried to run TPCDS-based benchmark I stumbled upon a similar
situation, finally I found that statistics were not computed, so CBO was
not kicking in, and the automatic retry goes with CBO off which was
failing
for something like 10 queries (subqueries cannot be decorrelated, but
also
some runtime errors).

Making sure that (column) statistics were correctly computed fixed the
problem.

Can you check if this is the case for you?

HTH,
Alessandro

On Tue, 1 Mar 2022 at 15:28, POSTECH CT <c...@pl.postech.ac.kr> wrote:

Hello Hive team,

I wonder if anyone in the Hive team has tried the TPC-DS benchmark on
the master branch recently.  We occasionally run TPC-DS system tests
using the master branch, and the tests don't succeed completely. Here
is how our TPC-DS tests proceed.

1. Compile and run Hive on Tez (not Hive-LLAP)
2. Load ORC tables from 1TB TPC-DS raw text data, and compute
statistics
3. Run 99 TPC-DS queries which were slightly modified to return
varying number of rows (rather than 100 rows)
4. Compare the results against the previous results

The previous results were obtained and cross-checked by running Hive
3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their
correctness.

For the latest commit in the master branch, step 2 fails. For earlier
commits (for example, commits in February 2021), step 3 fails where
several queries either fail or return wrong results.

We can compile and report the test results in this mailing list, but
would like to know if similar results have been reproduced by the Hive
team, in order to make sure that we did not make errors in our tests.

If it is okay to open a JIRA ticket that only reports failures in the
TPC-DS test, we could also perform git bi-sect to locate the commit
that begin to generate wrong results.

--- Sungwoo Park

On Tue, 1 Mar 2022, Zoltan Haindrich wrote:

Hey,

Great to hear that we are on the same side regarding these things :)

For around a week now - we have nightly builds for the master branch:
http://ci.hive.apache.org/job/hive-nightly/12/

I think we have 1 blocker issue:
https://issues.apache.org/jira/browse/HIVE-25665

I know about one more thing I would rather get fixed before we release
it:
https://issues.apache.org/jira/browse/HIVE-25994
The best would be to introduce smoke tests (HIVE-22302) to ensure that
something like this will not happen in the future - but we should
probably
start moving forward.

I think we could call the first iteration of this as "4.0.0-alpha-1"
:)

I've added 4.0.0-alpha-1 as a version - and added the above two ticket
to it.


https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%204.0.0-alpha-1

Are there any more things you guys know which would be needed?

cheers,
Zoltan


On 2/22/22 12:18 PM, Peter Vary wrote:
I would vote for 4.0.0-alpha-1 or similar for all of the components.

When we have more stable releases I would keep the 4.x.x schema,
since
everyone is familiar with it, and I do not see a really good reason
to
change it.

Thanks,
Peter


On 2022. Feb 10., at 3:34, Szehon Ho <szehon.apa...@gmail.com>
wrote:

+1 that would be awesome to see Hive master released after so long.

Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would
pick
any 3.x or calendar date (which could tend to slip and be more
confusing?).

Thanks in any case to get the ball rolling.
Szehon

On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich <k...@rxd.hu>
wrote:

Hey,

Thank you guys for chiming in; versioning is for sure something we
should
get to some common ground.
Its a triple problem right now; I think we have the following
things:
* storage-api
** we have "2.7.3-SNAPSHOT" in the repo
***


https://github.com/apache/hive/blob/0d1cffffc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27
** meanwhile we already have 2.8.1 released to maven central
***
https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api
* standalone-metastore
** 4.0.0-SNAPSHOT in the repo
** last release is 3.1.2
* hive
** 4.0.0-SNAPSHOT in the repo
** last release is 3.1.2

Regarding the actual version number I'm not entirely sure where we
should
start the numbering - that's why I was referring to it as Hive-X
in my
first letter.

I think the key point here would be to start shipping releases
regularily
and not the actual version number we will use - I'll kinda open to
any
versioning scheme which
reflects that this is a newer release than 3.1.2.

I could imagine the following ones:
(A) start with something less expected; but keep 3 in the prefix to
reflect that this is not yet 4.0
     I can imagine the following numbers:
     3.900.0, 3.901.0, ...
     3.9.0, 3.9.1, ...
(B) start 4.0.0
     4.0.0, 4.1.0, ...
(C) jump to some calendar based version number like 2022.2.9
     trunk based development has pros and cons...making a move like
this
irreversibly pledges trunk based development; and makes release
branches
hard to introduce
(X) somewhat orthogonal is to (also) use some suffixes
     4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1
     this is probably the most tempting to use - but this versioning
schema with a non-changing MINOR and PATCH number will
     also suggest that the actual software is fully compatible - and
only
bugs are being fixed - which will not be true...

I really like the idea to suffix these releases with alpha or beta
-
which
will communicate our level commitment that these are not 100%
production
ready artifacts.

I think we could fix HIVE-25665; and probably experiment with
4.0.0-alpha1
for start...

This also means there should *not* be a branch-4 after releasing
Hive
4.0
and let that diverge (and becomes the next, super-ignored
branch-3),
correct; no need to keep a branch we don't maintain...but in any
case
I
think we can postpone this decision until there will be something
to
release... :)

cheers,
Zoltan



On 2/9/22 10:23 AM, L?szl? Bodor wrote:
Hi All!

A purely technical question: what will the SNAPSHOT version become
after
releasing Hive 4.0.0? I think this is important, as it defines and
reflects
the future release plans.

Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 +
branch-3.
Hive is an evolving and super-active project: if we want to make
regular
releases, we should simply release Hive 4.0 and bump pom to
4.1.0-SNAPSHOT,
which clearly says that we can release Hive 4.1 anytime we want,
without
being frustrated about "whether we included enough cool stuff to
release
5.0".

This also means there should *not* be a branch-4 after releasing
Hive
4.0
and let that diverge (and becomes the next, super-ignored
branch-3),
only
when we end up bringing a minor backward-incompatible thing that
needs a
4.0.x, and when it happens, we'll create *branch-4.0 *on demand.
For
me,
a
branch called *branch-4.0* doesn't imply either I can expect cool
releases
in the future from there or the branch is maintained and tries to
be
in
sync with the *master*.

Regards,
Laszlo Bodor

Alessandro Solimando <alessandro.solima...@gmail.com> ezt ?rta
(id?pont:
2022. febr. 8., K, 16:42):

Hello everyone,
thank you for starting this discussion.

I agree that releasing the master branch regularly and
sufficiently
often
is welcome and vital for the health of the community.

It would be great to hear from others too, especially PMC members
and
committers, but even simple contributors/followers as myself.

Best regards,
Alessandro

On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis <
zabe...@gmail.com

wrote:

Hello,

Thanks for starting the discussion Zoltan.

I strongly believe that it is important to have regular and
often
releases
otherwise people will create and maintain separate Hive forks.
The latter is not good for the project and the community may
lose
valuable
members because of it.

Going forward I fully agree that there is no point bringing up
strong
blockers for the next release. For sure there are many backward
incompatible changes and possibly unstable features but unless
we
get
a
release out it will be difficult to determine what is broken and
what
needs
to be fixed.

Due to the big number of changes that are going to appear in the
next
version I would suggest using the terms Hive X-alpha, Hive
X-beta
for
the
first few releases. This will make it clear to the end users
that
they
need
to be careful when upgrading from an older version and it will
give us
a
bit more time and freedom to treat issues that the users will
likely
discover.

The only real blocker that we may want to treat is HIVE-25665
[1]
but
we
can continue the discussion under that ticket and re-evaluate if
necessary,

Best,
Stamatis

[1] https://issues.apache.org/jira/browse/HIVE-25665


On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich <k...@rxd.hu>
wrote:

Hey All,

We didn't made a release for a long time now; (3.1.2 was
released
on
26
August 2019) - and I think because we didn't made that many
branch-3
releases; not too many fixes
were ported there - which made that release branch kinda erode
away.

We have a lot of new features/changes in the current master.
I think instead of aiming for big feature-packed releases we
should
aim
for making a regular release every few months - we should make
regular
releases which people could
install and use.
After all releasing Hive after more than 2 years would be big
step
forward
in itself alone - we have so many improvements that I can't
even
count...

But I may know not every aspects of the project / states of
some
internal
features - so I would like to ask you:
What would be the bare minimum requirements before we could
release
the
current master as Hive X?

There are many nice-to-have-s like:
* hadoop upgrade
* jdk11
* remove HoS or MR
* ?
but I don't think these are blockers...we can make any of these
in
the
next release if we start making them...

cheers,
Zoltan












Reply via email to