Hi Team, Could we create tickets for the issues? I think it would be good to collect the issues/potential blockers in the jira instead of having a complicated mail thread.
If we set the target version to 4.0.0-alpha-1, then we can easily use the following filter to see the status of the tasks: https://issues.apache.org/jira/issues/?jql=project%3D%22HIVE%22%20AND%20%22Target%20Version%2Fs%22%3D%224.0.0-alpha-1%22 <https://issues.apache.org/jira/issues/?jql=project=%22HIVE%22%20AND%20%22Target%20Version/s%22=%224.0.0-alpha-1%22> @Stamatis: Sadly I have missed your letter/jira and created my own with the fix for building from the src package: https://issues.apache.org/jira/browse/HIVE-25997 <https://issues.apache.org/jira/browse/HIVE-25997> If you have time, I would like to ask you to review. If anyone knows of any blocker I would like to ask them to create a jira for that too. Thanks, Peter > On 2022. Mar 2., at 7:04, Sungwoo Park <c...@pl.postech.ac.kr> wrote: > > Hello Alessandro, > > For the latest commit, loading ORC tables fails (with the log message shown > below). Let me try to find a commit that introduces this bug and create a > JIRA ticket. > > --- Sungwoo > > 2022-03-02 05:41:56,578 ERROR [Thread-73] exec.StatsTask: Failed to run stats > task > java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: Input > path does not exist: > hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001 > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:622) > at > org.apache.hadoop.hive.ql.stats.ColStatsProcessor.constructColumnStatsFromPackedRows(ColStatsProcessor.java:105) > at > org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:200) > at > org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:93) > at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) > at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:83) > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.generateWrappedSplits(FetchOperator.java:435) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:402) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:306) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560) > ... 7 more > > On Tue, 1 Mar 2022, Alessandro Solimando wrote: > >> Hi Sungwoo, >> last time I tried to run TPCDS-based benchmark I stumbled upon a similar >> situation, finally I found that statistics were not computed, so CBO was >> not kicking in, and the automatic retry goes with CBO off which was failing >> for something like 10 queries (subqueries cannot be decorrelated, but also >> some runtime errors). >> >> Making sure that (column) statistics were correctly computed fixed the >> problem. >> >> Can you check if this is the case for you? >> >> HTH, >> Alessandro >> >> On Tue, 1 Mar 2022 at 15:28, POSTECH CT <c...@pl.postech.ac.kr> wrote: >> >>> Hello Hive team, >>> >>> I wonder if anyone in the Hive team has tried the TPC-DS benchmark on >>> the master branch recently. We occasionally run TPC-DS system tests >>> using the master branch, and the tests don't succeed completely. Here >>> is how our TPC-DS tests proceed. >>> >>> 1. Compile and run Hive on Tez (not Hive-LLAP) >>> 2. Load ORC tables from 1TB TPC-DS raw text data, and compute statistics >>> 3. Run 99 TPC-DS queries which were slightly modified to return >>> varying number of rows (rather than 100 rows) >>> 4. Compare the results against the previous results >>> >>> The previous results were obtained and cross-checked by running Hive >>> 3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their >>> correctness. >>> >>> For the latest commit in the master branch, step 2 fails. For earlier >>> commits (for example, commits in February 2021), step 3 fails where >>> several queries either fail or return wrong results. >>> >>> We can compile and report the test results in this mailing list, but >>> would like to know if similar results have been reproduced by the Hive >>> team, in order to make sure that we did not make errors in our tests. >>> >>> If it is okay to open a JIRA ticket that only reports failures in the >>> TPC-DS test, we could also perform git bi-sect to locate the commit >>> that begin to generate wrong results. >>> >>> --- Sungwoo Park >>> >>> On Tue, 1 Mar 2022, Zoltan Haindrich wrote: >>> >>>> Hey, >>>> >>>> Great to hear that we are on the same side regarding these things :) >>>> >>>> For around a week now - we have nightly builds for the master branch: >>>> http://ci.hive.apache.org/job/hive-nightly/12/ >>>> >>>> I think we have 1 blocker issue: >>>> https://issues.apache.org/jira/browse/HIVE-25665 >>>> >>>> I know about one more thing I would rather get fixed before we release >>> it: >>>> https://issues.apache.org/jira/browse/HIVE-25994 >>>> The best would be to introduce smoke tests (HIVE-22302) to ensure that >>>> something like this will not happen in the future - but we should >>> probably >>>> start moving forward. >>>> >>>> I think we could call the first iteration of this as "4.0.0-alpha-1" :) >>>> >>>> I've added 4.0.0-alpha-1 as a version - and added the above two ticket >>> to it. >>>> >>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%204.0.0-alpha-1 >>>> >>>> Are there any more things you guys know which would be needed? >>>> >>>> cheers, >>>> Zoltan >>>> >>>> >>>> On 2/22/22 12:18 PM, Peter Vary wrote: >>>>> I would vote for 4.0.0-alpha-1 or similar for all of the components. >>>>> >>>>> When we have more stable releases I would keep the 4.x.x schema, since >>>>> everyone is familiar with it, and I do not see a really good reason to >>>>> change it. >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> >>>>>> On 2022. Feb 10., at 3:34, Szehon Ho <szehon.apa...@gmail.com> wrote: >>>>>> >>>>>> +1 that would be awesome to see Hive master released after so long. >>>>>> >>>>>> Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would >>> pick >>>>>> any 3.x or calendar date (which could tend to slip and be more >>>>>> confusing?). >>>>>> >>>>>> Thanks in any case to get the ball rolling. >>>>>> Szehon >>>>>> >>>>>> On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich <k...@rxd.hu> wrote: >>>>>> >>>>>>> Hey, >>>>>>> >>>>>>> Thank you guys for chiming in; versioning is for sure something we >>> should >>>>>>> get to some common ground. >>>>>>> Its a triple problem right now; I think we have the following things: >>>>>>> * storage-api >>>>>>> ** we have "2.7.3-SNAPSHOT" in the repo >>>>>>> *** >>>>>>> >>> https://github.com/apache/hive/blob/0d1cffffc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27 >>>>>>> ** meanwhile we already have 2.8.1 released to maven central >>>>>>> *** >>> https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api >>>>>>> * standalone-metastore >>>>>>> ** 4.0.0-SNAPSHOT in the repo >>>>>>> ** last release is 3.1.2 >>>>>>> * hive >>>>>>> ** 4.0.0-SNAPSHOT in the repo >>>>>>> ** last release is 3.1.2 >>>>>>> >>>>>>> Regarding the actual version number I'm not entirely sure where we >>> should >>>>>>> start the numbering - that's why I was referring to it as Hive-X in my >>>>>>> first letter. >>>>>>> >>>>>>> I think the key point here would be to start shipping releases >>> regularily >>>>>>> and not the actual version number we will use - I'll kinda open to any >>>>>>> versioning scheme which >>>>>>> reflects that this is a newer release than 3.1.2. >>>>>>> >>>>>>> I could imagine the following ones: >>>>>>> (A) start with something less expected; but keep 3 in the prefix to >>>>>>> reflect that this is not yet 4.0 >>>>>>> I can imagine the following numbers: >>>>>>> 3.900.0, 3.901.0, ... >>>>>>> 3.9.0, 3.9.1, ... >>>>>>> (B) start 4.0.0 >>>>>>> 4.0.0, 4.1.0, ... >>>>>>> (C) jump to some calendar based version number like 2022.2.9 >>>>>>> trunk based development has pros and cons...making a move like >>> this >>>>>>> irreversibly pledges trunk based development; and makes release >>> branches >>>>>>> hard to introduce >>>>>>> (X) somewhat orthogonal is to (also) use some suffixes >>>>>>> 4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1 >>>>>>> this is probably the most tempting to use - but this versioning >>>>>>> schema with a non-changing MINOR and PATCH number will >>>>>>> also suggest that the actual software is fully compatible - and >>> only >>>>>>> bugs are being fixed - which will not be true... >>>>>>> >>>>>>> I really like the idea to suffix these releases with alpha or beta - >>>>>>> which >>>>>>> will communicate our level commitment that these are not 100% >>> production >>>>>>> ready artifacts. >>>>>>> >>>>>>> I think we could fix HIVE-25665; and probably experiment with >>>>>>> 4.0.0-alpha1 >>>>>>> for start... >>>>>>> >>>>>>>> This also means there should *not* be a branch-4 after releasing Hive >>>>>>> 4.0 >>>>>>>> and let that diverge (and becomes the next, super-ignored branch-3), >>>>>>> correct; no need to keep a branch we don't maintain...but in any case >>> I >>>>>>> think we can postpone this decision until there will be something to >>>>>>> release... :) >>>>>>> >>>>>>> cheers, >>>>>>> Zoltan >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 2/9/22 10:23 AM, L?szl? Bodor wrote: >>>>>>>> Hi All! >>>>>>>> >>>>>>>> A purely technical question: what will the SNAPSHOT version become >>> after >>>>>>>> releasing Hive 4.0.0? I think this is important, as it defines and >>>>>>> reflects >>>>>>>> the future release plans. >>>>>>>> >>>>>>>> Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 + >>> branch-3. >>>>>>>> Hive is an evolving and super-active project: if we want to make >>> regular >>>>>>>> releases, we should simply release Hive 4.0 and bump pom to >>>>>>> 4.1.0-SNAPSHOT, >>>>>>>> which clearly says that we can release Hive 4.1 anytime we want, >>> without >>>>>>>> being frustrated about "whether we included enough cool stuff to >>> release >>>>>>>> 5.0". >>>>>>>> >>>>>>>> This also means there should *not* be a branch-4 after releasing >>> Hive >>>>>>>> 4.0 >>>>>>>> and let that diverge (and becomes the next, super-ignored branch-3), >>>>>>>> only >>>>>>>> when we end up bringing a minor backward-incompatible thing that >>> needs a >>>>>>>> 4.0.x, and when it happens, we'll create *branch-4.0 *on demand. For >>> me, >>>>>>> a >>>>>>>> branch called *branch-4.0* doesn't imply either I can expect cool >>>>>>> releases >>>>>>>> in the future from there or the branch is maintained and tries to be >>> in >>>>>>>> sync with the *master*. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Laszlo Bodor >>>>>>>> >>>>>>>> Alessandro Solimando <alessandro.solima...@gmail.com> ezt ?rta >>> (id?pont: >>>>>>>> 2022. febr. 8., K, 16:42): >>>>>>>> >>>>>>>>> Hello everyone, >>>>>>>>> thank you for starting this discussion. >>>>>>>>> >>>>>>>>> I agree that releasing the master branch regularly and sufficiently >>>>>>> often >>>>>>>>> is welcome and vital for the health of the community. >>>>>>>>> >>>>>>>>> It would be great to hear from others too, especially PMC members >>> and >>>>>>>>> committers, but even simple contributors/followers as myself. >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Alessandro >>>>>>>>> >>>>>>>>> On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis <zabe...@gmail.com >>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> Thanks for starting the discussion Zoltan. >>>>>>>>>> >>>>>>>>>> I strongly believe that it is important to have regular and often >>>>>>>>> releases >>>>>>>>>> otherwise people will create and maintain separate Hive forks. >>>>>>>>>> The latter is not good for the project and the community may lose >>>>>>>>> valuable >>>>>>>>>> members because of it. >>>>>>>>>> >>>>>>>>>> Going forward I fully agree that there is no point bringing up >>> strong >>>>>>>>>> blockers for the next release. For sure there are many backward >>>>>>>>>> incompatible changes and possibly unstable features but unless we >>> get >>>>>>>>>> a >>>>>>>>>> release out it will be difficult to determine what is broken and >>> what >>>>>>>>> needs >>>>>>>>>> to be fixed. >>>>>>>>>> >>>>>>>>>> Due to the big number of changes that are going to appear in the >>> next >>>>>>>>>> version I would suggest using the terms Hive X-alpha, Hive X-beta >>> for >>>>>>> the >>>>>>>>>> first few releases. This will make it clear to the end users that >>> they >>>>>>>>> need >>>>>>>>>> to be careful when upgrading from an older version and it will >>> give us >>>>>>> a >>>>>>>>>> bit more time and freedom to treat issues that the users will >>> likely >>>>>>>>>> discover. >>>>>>>>>> >>>>>>>>>> The only real blocker that we may want to treat is HIVE-25665 [1] >>> but >>>>>>> we >>>>>>>>>> can continue the discussion under that ticket and re-evaluate if >>>>>>>>> necessary, >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stamatis >>>>>>>>>> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/HIVE-25665 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich <k...@rxd.hu> >>> wrote: >>>>>>>>>> >>>>>>>>>>> Hey All, >>>>>>>>>>> >>>>>>>>>>> We didn't made a release for a long time now; (3.1.2 was released >>> on >>>>>>> 26 >>>>>>>>>>> August 2019) - and I think because we didn't made that many >>> branch-3 >>>>>>>>>>> releases; not too many fixes >>>>>>>>>>> were ported there - which made that release branch kinda erode >>> away. >>>>>>>>>>> >>>>>>>>>>> We have a lot of new features/changes in the current master. >>>>>>>>>>> I think instead of aiming for big feature-packed releases we >>> should >>>>>>> aim >>>>>>>>>>> for making a regular release every few months - we should make >>>>>>>>>>> regular >>>>>>>>>>> releases which people could >>>>>>>>>>> install and use. >>>>>>>>>>> After all releasing Hive after more than 2 years would be big step >>>>>>>>>> forward >>>>>>>>>>> in itself alone - we have so many improvements that I can't even >>>>>>>>> count... >>>>>>>>>>> >>>>>>>>>>> But I may know not every aspects of the project / states of some >>>>>>>>> internal >>>>>>>>>>> features - so I would like to ask you: >>>>>>>>>>> What would be the bare minimum requirements before we could >>> release >>>>>>> the >>>>>>>>>>> current master as Hive X? >>>>>>>>>>> >>>>>>>>>>> There are many nice-to-have-s like: >>>>>>>>>>> * hadoop upgrade >>>>>>>>>>> * jdk11 >>>>>>>>>>> * remove HoS or MR >>>>>>>>>>> * ? >>>>>>>>>>> but I don't think these are blockers...we can make any of these >>> in >>>>>>>>>>> the >>>>>>>>>>> next release if we start making them... >>>>>>>>>>> >>>>>>>>>>> cheers, >>>>>>>>>>> Zoltan >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> >>