Hi Team,
Could we create tickets for the issues?
I think it would be good to collect the issues/potential blockers in the
jira instead of having a complicated mail thread.
If we set the target version to 4.0.0-alpha-1, then we can easily use the
following filter to see the status of the tasks:
https://issues.apache.org/jira/issues/?jql=project%3D%22HIVE%22%20AND%20%22Target%20Version%2Fs%22%3D%224.0.0-alpha-1%22
<
https://issues.apache.org/jira/issues/?jql=project=%22HIVE%22%20AND%20%22Target%20Version/s%22=%224.0.0-alpha-1%22
@Stamatis: Sadly I have missed your letter/jira and created my own with
the fix for building from the src package:
https://issues.apache.org/jira/browse/HIVE-25997 <
https://issues.apache.org/jira/browse/HIVE-25997>
If you have time, I would like to ask you to review.
If anyone knows of any blocker I would like to ask them to create a jira
for that too.
Thanks,
Peter
On 2022. Mar 2., at 7:04, Sungwoo Park <c...@pl.postech.ac.kr> wrote:
Hello Alessandro,
For the latest commit, loading ORC tables fails (with the log message
shown below). Let me try to find a commit that introduces this bug and
create a JIRA ticket.
--- Sungwoo
2022-03-02 05:41:56,578 ERROR [Thread-73] exec.StatsTask: Failed to run
stats task
java.io.IOException: org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:622)
at
org.apache.hadoop.hive.ql.stats.ColStatsProcessor.constructColumnStatsFromPackedRows(ColStatsProcessor.java:105)
at
org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:200)
at
org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:93)
at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:83)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path
does not exist:
hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.generateWrappedSplits(FetchOperator.java:435)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:402)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:306)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560)
... 7 more
On Tue, 1 Mar 2022, Alessandro Solimando wrote:
Hi Sungwoo,
last time I tried to run TPCDS-based benchmark I stumbled upon a similar
situation, finally I found that statistics were not computed, so CBO was
not kicking in, and the automatic retry goes with CBO off which was
failing
for something like 10 queries (subqueries cannot be decorrelated, but
also
some runtime errors).
Making sure that (column) statistics were correctly computed fixed the
problem.
Can you check if this is the case for you?
HTH,
Alessandro
On Tue, 1 Mar 2022 at 15:28, POSTECH CT <c...@pl.postech.ac.kr> wrote:
Hello Hive team,
I wonder if anyone in the Hive team has tried the TPC-DS benchmark on
the master branch recently. We occasionally run TPC-DS system tests
using the master branch, and the tests don't succeed completely. Here
is how our TPC-DS tests proceed.
1. Compile and run Hive on Tez (not Hive-LLAP)
2. Load ORC tables from 1TB TPC-DS raw text data, and compute
statistics
3. Run 99 TPC-DS queries which were slightly modified to return
varying number of rows (rather than 100 rows)
4. Compare the results against the previous results
The previous results were obtained and cross-checked by running Hive
3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their
correctness.
For the latest commit in the master branch, step 2 fails. For earlier
commits (for example, commits in February 2021), step 3 fails where
several queries either fail or return wrong results.
We can compile and report the test results in this mailing list, but
would like to know if similar results have been reproduced by the Hive
team, in order to make sure that we did not make errors in our tests.
If it is okay to open a JIRA ticket that only reports failures in the
TPC-DS test, we could also perform git bi-sect to locate the commit
that begin to generate wrong results.
--- Sungwoo Park
On Tue, 1 Mar 2022, Zoltan Haindrich wrote:
Hey,
Great to hear that we are on the same side regarding these things :)
For around a week now - we have nightly builds for the master branch:
http://ci.hive.apache.org/job/hive-nightly/12/
I think we have 1 blocker issue:
https://issues.apache.org/jira/browse/HIVE-25665
I know about one more thing I would rather get fixed before we release
it:
https://issues.apache.org/jira/browse/HIVE-25994
The best would be to introduce smoke tests (HIVE-22302) to ensure that
something like this will not happen in the future - but we should
probably
start moving forward.
I think we could call the first iteration of this as "4.0.0-alpha-1"
:)
I've added 4.0.0-alpha-1 as a version - and added the above two ticket
to it.
https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%204.0.0-alpha-1
Are there any more things you guys know which would be needed?
cheers,
Zoltan
On 2/22/22 12:18 PM, Peter Vary wrote:
I would vote for 4.0.0-alpha-1 or similar for all of the components.
When we have more stable releases I would keep the 4.x.x schema,
since
everyone is familiar with it, and I do not see a really good reason
to
change it.
Thanks,
Peter
On 2022. Feb 10., at 3:34, Szehon Ho <szehon.apa...@gmail.com>
wrote:
+1 that would be awesome to see Hive master released after so long.
Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would
pick
any 3.x or calendar date (which could tend to slip and be more
confusing?).
Thanks in any case to get the ball rolling.
Szehon
On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich <k...@rxd.hu>
wrote:
Hey,
Thank you guys for chiming in; versioning is for sure something we
should
get to some common ground.
Its a triple problem right now; I think we have the following
things:
* storage-api
** we have "2.7.3-SNAPSHOT" in the repo
***
https://github.com/apache/hive/blob/0d1cffffc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27
** meanwhile we already have 2.8.1 released to maven central
***
https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api
* standalone-metastore
** 4.0.0-SNAPSHOT in the repo
** last release is 3.1.2
* hive
** 4.0.0-SNAPSHOT in the repo
** last release is 3.1.2
Regarding the actual version number I'm not entirely sure where we
should
start the numbering - that's why I was referring to it as Hive-X
in my
first letter.
I think the key point here would be to start shipping releases
regularily
and not the actual version number we will use - I'll kinda open to
any
versioning scheme which
reflects that this is a newer release than 3.1.2.
I could imagine the following ones:
(A) start with something less expected; but keep 3 in the prefix to
reflect that this is not yet 4.0
I can imagine the following numbers:
3.900.0, 3.901.0, ...
3.9.0, 3.9.1, ...
(B) start 4.0.0
4.0.0, 4.1.0, ...
(C) jump to some calendar based version number like 2022.2.9
trunk based development has pros and cons...making a move like
this
irreversibly pledges trunk based development; and makes release
branches
hard to introduce
(X) somewhat orthogonal is to (also) use some suffixes
4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1
this is probably the most tempting to use - but this versioning
schema with a non-changing MINOR and PATCH number will
also suggest that the actual software is fully compatible - and
only
bugs are being fixed - which will not be true...
I really like the idea to suffix these releases with alpha or beta
-
which
will communicate our level commitment that these are not 100%
production
ready artifacts.
I think we could fix HIVE-25665; and probably experiment with
4.0.0-alpha1
for start...
This also means there should *not* be a branch-4 after releasing
Hive
4.0
and let that diverge (and becomes the next, super-ignored
branch-3),
correct; no need to keep a branch we don't maintain...but in any
case
I
think we can postpone this decision until there will be something
to
release... :)
cheers,
Zoltan
On 2/9/22 10:23 AM, L?szl? Bodor wrote:
Hi All!
A purely technical question: what will the SNAPSHOT version become
after
releasing Hive 4.0.0? I think this is important, as it defines and
reflects
the future release plans.
Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 +
branch-3.
Hive is an evolving and super-active project: if we want to make
regular
releases, we should simply release Hive 4.0 and bump pom to
4.1.0-SNAPSHOT,
which clearly says that we can release Hive 4.1 anytime we want,
without
being frustrated about "whether we included enough cool stuff to
release
5.0".
This also means there should *not* be a branch-4 after releasing
Hive
4.0
and let that diverge (and becomes the next, super-ignored
branch-3),
only
when we end up bringing a minor backward-incompatible thing that
needs a
4.0.x, and when it happens, we'll create *branch-4.0 *on demand.
For
me,
a
branch called *branch-4.0* doesn't imply either I can expect cool
releases
in the future from there or the branch is maintained and tries to
be
in
sync with the *master*.
Regards,
Laszlo Bodor
Alessandro Solimando <alessandro.solima...@gmail.com> ezt ?rta
(id?pont:
2022. febr. 8., K, 16:42):
Hello everyone,
thank you for starting this discussion.
I agree that releasing the master branch regularly and
sufficiently
often
is welcome and vital for the health of the community.
It would be great to hear from others too, especially PMC members
and
committers, but even simple contributors/followers as myself.
Best regards,
Alessandro
On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis <
zabe...@gmail.com
wrote:
Hello,
Thanks for starting the discussion Zoltan.
I strongly believe that it is important to have regular and
often
releases
otherwise people will create and maintain separate Hive forks.
The latter is not good for the project and the community may
lose
valuable
members because of it.
Going forward I fully agree that there is no point bringing up
strong
blockers for the next release. For sure there are many backward
incompatible changes and possibly unstable features but unless
we
get
a
release out it will be difficult to determine what is broken and
what
needs
to be fixed.
Due to the big number of changes that are going to appear in the
next
version I would suggest using the terms Hive X-alpha, Hive
X-beta
for
the
first few releases. This will make it clear to the end users
that
they
need
to be careful when upgrading from an older version and it will
give us
a
bit more time and freedom to treat issues that the users will
likely
discover.
The only real blocker that we may want to treat is HIVE-25665
[1]
but
we
can continue the discussion under that ticket and re-evaluate if
necessary,
Best,
Stamatis
[1] https://issues.apache.org/jira/browse/HIVE-25665
On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich <k...@rxd.hu>
wrote:
Hey All,
We didn't made a release for a long time now; (3.1.2 was
released
on
26
August 2019) - and I think because we didn't made that many
branch-3
releases; not too many fixes
were ported there - which made that release branch kinda erode
away.
We have a lot of new features/changes in the current master.
I think instead of aiming for big feature-packed releases we
should
aim
for making a regular release every few months - we should make
regular
releases which people could
install and use.
After all releasing Hive after more than 2 years would be big
step
forward
in itself alone - we have so many improvements that I can't
even
count...
But I may know not every aspects of the project / states of
some
internal
features - so I would like to ask you:
What would be the bare minimum requirements before we could
release
the
current master as Hive X?
There are many nice-to-have-s like:
* hadoop upgrade
* jdk11
* remove HoS or MR
* ?
but I don't think these are blockers...we can make any of these
in
the
next release if we start making them...
cheers,
Zoltan