Good idea Zoltan, joined the channel. I would like to scope reasonably small, so I agree with focusing on 4.0.0-alpha-1
> On 2022. Mar 2., at 11:01, Zoltan Haindrich <k...@rxd.hu> wrote: > > Hey, > > regarding 4.0.0 / 4.0.0-alpha-1 target/fix versions in the jira: > * I think we should change all already resolved tickets with fix version > 4.0.0 to have fix version 4.0.0-alpha-1 > ** this could be postponed until we are actually releasing the thing as I > think everyone committing to the master is entering 4.0.0 as fix version > without much aftertought...this could probably change after we get the first > release out. > * regarding the the existing tickets with fix version/target version 4.0.0 - > I think that would be a bit too much (>200 tickets) > ** some numbers: > *** 239 tickets open now > *** 224 was not updated in the last 90 days > *** 216 was not changed in the last 180 days > *** 178 was not updated in the last 360 days > ** as a matter of fact I think many of these tickets shouldn't even have a > target or fix version - and most of them should be unassigned...I don't want > to get lost in this right now...I think for now we should keep the scope > small and only care with 4.0.0-alpha-1 tickets > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20hive%20and%20resolutiondate%20%20is%20empty%20and%20(fixVersion%20%20in%20(%274.0.0%27)%20or%20cf%5B12310320%5D%20%20in%20(%274.0.0%27)) > > I think for faster communication regarding these things we could also utilize > the #hive channel on the ASF slack - what do you guys think? > > cheers, > Zoltan > > On 3/2/22 9:51 AM, Stamatis Zampetakis wrote: >> Agree with Peter, creating JIRAs is the way to go. >> Putting the appropriate priority (e.g., BLOCKER) and version (4.0.0 or >> 4.0.0-alpha-1) when creating the JIRA should be enough to keep us on track. >> I am mentioning both 4.0.0 and 4.0.0-alpha-1 because eventually I think we >> are gonna move everything with target 4.0.0 to 4.0.0-alpha-1. >> On Wed, Mar 2, 2022 at 9:37 AM Peter Vary <pv...@cloudera.com.invalid> >> wrote: >>> Hi Team, >>> >>> Could we create tickets for the issues? >>> I think it would be good to collect the issues/potential blockers in the >>> jira instead of having a complicated mail thread. >>> >>> If we set the target version to 4.0.0-alpha-1, then we can easily use the >>> following filter to see the status of the tasks: >>> >>> https://issues.apache.org/jira/issues/?jql=project%3D%22HIVE%22%20AND%20%22Target%20Version%2Fs%22%3D%224.0.0-alpha-1%22 >>> < >>> https://issues.apache.org/jira/issues/?jql=project=%22HIVE%22%20AND%20%22Target%20Version/s%22=%224.0.0-alpha-1%22 >>>> >>> >>> @Stamatis: Sadly I have missed your letter/jira and created my own with >>> the fix for building from the src package: >>> https://issues.apache.org/jira/browse/HIVE-25997 < >>> https://issues.apache.org/jira/browse/HIVE-25997> >>> If you have time, I would like to ask you to review. >>> >>> If anyone knows of any blocker I would like to ask them to create a jira >>> for that too. >>> >>> Thanks, >>> Peter >>> >>> >>>> On 2022. Mar 2., at 7:04, Sungwoo Park <c...@pl.postech.ac.kr> wrote: >>>> >>>> Hello Alessandro, >>>> >>>> For the latest commit, loading ORC tables fails (with the log message >>> shown below). Let me try to find a commit that introduces this bug and >>> create a JIRA ticket. >>>> >>>> --- Sungwoo >>>> >>>> 2022-03-02 05:41:56,578 ERROR [Thread-73] exec.StatsTask: Failed to run >>> stats task >>>> java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: >>> Input path does not exist: >>> hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001 >>>> at >>> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:622) >>>> at >>> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.constructColumnStatsFromPackedRows(ColStatsProcessor.java:105) >>>> at >>> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:200) >>>> at >>> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:93) >>>> at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107) >>>> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212) >>>> at >>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) >>>> at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:83) >>>> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path >>> does not exist: >>> hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001 >>>> at >>> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) >>>> at >>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) >>>> at >>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) >>>> at >>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) >>>> at >>> org.apache.hadoop.hive.ql.exec.FetchOperator.generateWrappedSplits(FetchOperator.java:435) >>>> at >>> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:402) >>>> at >>> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:306) >>>> at >>> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560) >>>> ... 7 more >>>> >>>> On Tue, 1 Mar 2022, Alessandro Solimando wrote: >>>> >>>>> Hi Sungwoo, >>>>> last time I tried to run TPCDS-based benchmark I stumbled upon a similar >>>>> situation, finally I found that statistics were not computed, so CBO was >>>>> not kicking in, and the automatic retry goes with CBO off which was >>> failing >>>>> for something like 10 queries (subqueries cannot be decorrelated, but >>> also >>>>> some runtime errors). >>>>> >>>>> Making sure that (column) statistics were correctly computed fixed the >>>>> problem. >>>>> >>>>> Can you check if this is the case for you? >>>>> >>>>> HTH, >>>>> Alessandro >>>>> >>>>> On Tue, 1 Mar 2022 at 15:28, POSTECH CT <c...@pl.postech.ac.kr> wrote: >>>>> >>>>>> Hello Hive team, >>>>>> >>>>>> I wonder if anyone in the Hive team has tried the TPC-DS benchmark on >>>>>> the master branch recently. We occasionally run TPC-DS system tests >>>>>> using the master branch, and the tests don't succeed completely. Here >>>>>> is how our TPC-DS tests proceed. >>>>>> >>>>>> 1. Compile and run Hive on Tez (not Hive-LLAP) >>>>>> 2. Load ORC tables from 1TB TPC-DS raw text data, and compute >>> statistics >>>>>> 3. Run 99 TPC-DS queries which were slightly modified to return >>>>>> varying number of rows (rather than 100 rows) >>>>>> 4. Compare the results against the previous results >>>>>> >>>>>> The previous results were obtained and cross-checked by running Hive >>>>>> 3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their >>>>>> correctness. >>>>>> >>>>>> For the latest commit in the master branch, step 2 fails. For earlier >>>>>> commits (for example, commits in February 2021), step 3 fails where >>>>>> several queries either fail or return wrong results. >>>>>> >>>>>> We can compile and report the test results in this mailing list, but >>>>>> would like to know if similar results have been reproduced by the Hive >>>>>> team, in order to make sure that we did not make errors in our tests. >>>>>> >>>>>> If it is okay to open a JIRA ticket that only reports failures in the >>>>>> TPC-DS test, we could also perform git bi-sect to locate the commit >>>>>> that begin to generate wrong results. >>>>>> >>>>>> --- Sungwoo Park >>>>>> >>>>>> On Tue, 1 Mar 2022, Zoltan Haindrich wrote: >>>>>> >>>>>>> Hey, >>>>>>> >>>>>>> Great to hear that we are on the same side regarding these things :) >>>>>>> >>>>>>> For around a week now - we have nightly builds for the master branch: >>>>>>> http://ci.hive.apache.org/job/hive-nightly/12/ >>>>>>> >>>>>>> I think we have 1 blocker issue: >>>>>>> https://issues.apache.org/jira/browse/HIVE-25665 >>>>>>> >>>>>>> I know about one more thing I would rather get fixed before we release >>>>>> it: >>>>>>> https://issues.apache.org/jira/browse/HIVE-25994 >>>>>>> The best would be to introduce smoke tests (HIVE-22302) to ensure that >>>>>>> something like this will not happen in the future - but we should >>>>>> probably >>>>>>> start moving forward. >>>>>>> >>>>>>> I think we could call the first iteration of this as "4.0.0-alpha-1" >>> :) >>>>>>> >>>>>>> I've added 4.0.0-alpha-1 as a version - and added the above two ticket >>>>>> to it. >>>>>>> >>>>>> >>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%204.0.0-alpha-1 >>>>>>> >>>>>>> Are there any more things you guys know which would be needed? >>>>>>> >>>>>>> cheers, >>>>>>> Zoltan >>>>>>> >>>>>>> >>>>>>> On 2/22/22 12:18 PM, Peter Vary wrote: >>>>>>>> I would vote for 4.0.0-alpha-1 or similar for all of the components. >>>>>>>> >>>>>>>> When we have more stable releases I would keep the 4.x.x schema, >>> since >>>>>>>> everyone is familiar with it, and I do not see a really good reason >>> to >>>>>>>> change it. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Peter >>>>>>>> >>>>>>>> >>>>>>>>> On 2022. Feb 10., at 3:34, Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>>>>>>>> >>>>>>>>> +1 that would be awesome to see Hive master released after so long. >>>>>>>>> >>>>>>>>> Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would >>>>>> pick >>>>>>>>> any 3.x or calendar date (which could tend to slip and be more >>>>>>>>> confusing?). >>>>>>>>> >>>>>>>>> Thanks in any case to get the ball rolling. >>>>>>>>> Szehon >>>>>>>>> >>>>>>>>> On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich <k...@rxd.hu> >>> wrote: >>>>>>>>> >>>>>>>>>> Hey, >>>>>>>>>> >>>>>>>>>> Thank you guys for chiming in; versioning is for sure something we >>>>>> should >>>>>>>>>> get to some common ground. >>>>>>>>>> Its a triple problem right now; I think we have the following >>> things: >>>>>>>>>> * storage-api >>>>>>>>>> ** we have "2.7.3-SNAPSHOT" in the repo >>>>>>>>>> *** >>>>>>>>>> >>>>>> >>> https://github.com/apache/hive/blob/0d1cffffc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27 >>>>>>>>>> ** meanwhile we already have 2.8.1 released to maven central >>>>>>>>>> *** >>>>>> https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api >>>>>>>>>> * standalone-metastore >>>>>>>>>> ** 4.0.0-SNAPSHOT in the repo >>>>>>>>>> ** last release is 3.1.2 >>>>>>>>>> * hive >>>>>>>>>> ** 4.0.0-SNAPSHOT in the repo >>>>>>>>>> ** last release is 3.1.2 >>>>>>>>>> >>>>>>>>>> Regarding the actual version number I'm not entirely sure where we >>>>>> should >>>>>>>>>> start the numbering - that's why I was referring to it as Hive-X >>> in my >>>>>>>>>> first letter. >>>>>>>>>> >>>>>>>>>> I think the key point here would be to start shipping releases >>>>>> regularily >>>>>>>>>> and not the actual version number we will use - I'll kinda open to >>> any >>>>>>>>>> versioning scheme which >>>>>>>>>> reflects that this is a newer release than 3.1.2. >>>>>>>>>> >>>>>>>>>> I could imagine the following ones: >>>>>>>>>> (A) start with something less expected; but keep 3 in the prefix to >>>>>>>>>> reflect that this is not yet 4.0 >>>>>>>>>> I can imagine the following numbers: >>>>>>>>>> 3.900.0, 3.901.0, ... >>>>>>>>>> 3.9.0, 3.9.1, ... >>>>>>>>>> (B) start 4.0.0 >>>>>>>>>> 4.0.0, 4.1.0, ... >>>>>>>>>> (C) jump to some calendar based version number like 2022.2.9 >>>>>>>>>> trunk based development has pros and cons...making a move like >>>>>> this >>>>>>>>>> irreversibly pledges trunk based development; and makes release >>>>>> branches >>>>>>>>>> hard to introduce >>>>>>>>>> (X) somewhat orthogonal is to (also) use some suffixes >>>>>>>>>> 4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1 >>>>>>>>>> this is probably the most tempting to use - but this versioning >>>>>>>>>> schema with a non-changing MINOR and PATCH number will >>>>>>>>>> also suggest that the actual software is fully compatible - and >>>>>> only >>>>>>>>>> bugs are being fixed - which will not be true... >>>>>>>>>> >>>>>>>>>> I really like the idea to suffix these releases with alpha or beta >>> - >>>>>>>>>> which >>>>>>>>>> will communicate our level commitment that these are not 100% >>>>>> production >>>>>>>>>> ready artifacts. >>>>>>>>>> >>>>>>>>>> I think we could fix HIVE-25665; and probably experiment with >>>>>>>>>> 4.0.0-alpha1 >>>>>>>>>> for start... >>>>>>>>>> >>>>>>>>>>> This also means there should *not* be a branch-4 after releasing >>> Hive >>>>>>>>>> 4.0 >>>>>>>>>>> and let that diverge (and becomes the next, super-ignored >>> branch-3), >>>>>>>>>> correct; no need to keep a branch we don't maintain...but in any >>> case >>>>>> I >>>>>>>>>> think we can postpone this decision until there will be something >>> to >>>>>>>>>> release... :) >>>>>>>>>> >>>>>>>>>> cheers, >>>>>>>>>> Zoltan >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 2/9/22 10:23 AM, L?szl? Bodor wrote: >>>>>>>>>>> Hi All! >>>>>>>>>>> >>>>>>>>>>> A purely technical question: what will the SNAPSHOT version become >>>>>> after >>>>>>>>>>> releasing Hive 4.0.0? I think this is important, as it defines and >>>>>>>>>> reflects >>>>>>>>>>> the future release plans. >>>>>>>>>>> >>>>>>>>>>> Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 + >>>>>> branch-3. >>>>>>>>>>> Hive is an evolving and super-active project: if we want to make >>>>>> regular >>>>>>>>>>> releases, we should simply release Hive 4.0 and bump pom to >>>>>>>>>> 4.1.0-SNAPSHOT, >>>>>>>>>>> which clearly says that we can release Hive 4.1 anytime we want, >>>>>> without >>>>>>>>>>> being frustrated about "whether we included enough cool stuff to >>>>>> release >>>>>>>>>>> 5.0". >>>>>>>>>>> >>>>>>>>>>> This also means there should *not* be a branch-4 after releasing >>>>>> Hive >>>>>>>>>>> 4.0 >>>>>>>>>>> and let that diverge (and becomes the next, super-ignored >>> branch-3), >>>>>>>>>>> only >>>>>>>>>>> when we end up bringing a minor backward-incompatible thing that >>>>>> needs a >>>>>>>>>>> 4.0.x, and when it happens, we'll create *branch-4.0 *on demand. >>> For >>>>>> me, >>>>>>>>>> a >>>>>>>>>>> branch called *branch-4.0* doesn't imply either I can expect cool >>>>>>>>>> releases >>>>>>>>>>> in the future from there or the branch is maintained and tries to >>> be >>>>>> in >>>>>>>>>>> sync with the *master*. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Laszlo Bodor >>>>>>>>>>> >>>>>>>>>>> Alessandro Solimando <alessandro.solima...@gmail.com> ezt ?rta >>>>>> (id?pont: >>>>>>>>>>> 2022. febr. 8., K, 16:42): >>>>>>>>>>> >>>>>>>>>>>> Hello everyone, >>>>>>>>>>>> thank you for starting this discussion. >>>>>>>>>>>> >>>>>>>>>>>> I agree that releasing the master branch regularly and >>> sufficiently >>>>>>>>>> often >>>>>>>>>>>> is welcome and vital for the health of the community. >>>>>>>>>>>> >>>>>>>>>>>> It would be great to hear from others too, especially PMC members >>>>>> and >>>>>>>>>>>> committers, but even simple contributors/followers as myself. >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> Alessandro >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis < >>> zabe...@gmail.com >>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for starting the discussion Zoltan. >>>>>>>>>>>>> >>>>>>>>>>>>> I strongly believe that it is important to have regular and >>> often >>>>>>>>>>>> releases >>>>>>>>>>>>> otherwise people will create and maintain separate Hive forks. >>>>>>>>>>>>> The latter is not good for the project and the community may >>> lose >>>>>>>>>>>> valuable >>>>>>>>>>>>> members because of it. >>>>>>>>>>>>> >>>>>>>>>>>>> Going forward I fully agree that there is no point bringing up >>>>>> strong >>>>>>>>>>>>> blockers for the next release. For sure there are many backward >>>>>>>>>>>>> incompatible changes and possibly unstable features but unless >>> we >>>>>> get >>>>>>>>>>>>> a >>>>>>>>>>>>> release out it will be difficult to determine what is broken and >>>>>> what >>>>>>>>>>>> needs >>>>>>>>>>>>> to be fixed. >>>>>>>>>>>>> >>>>>>>>>>>>> Due to the big number of changes that are going to appear in the >>>>>> next >>>>>>>>>>>>> version I would suggest using the terms Hive X-alpha, Hive >>> X-beta >>>>>> for >>>>>>>>>> the >>>>>>>>>>>>> first few releases. This will make it clear to the end users >>> that >>>>>> they >>>>>>>>>>>> need >>>>>>>>>>>>> to be careful when upgrading from an older version and it will >>>>>> give us >>>>>>>>>> a >>>>>>>>>>>>> bit more time and freedom to treat issues that the users will >>>>>> likely >>>>>>>>>>>>> discover. >>>>>>>>>>>>> >>>>>>>>>>>>> The only real blocker that we may want to treat is HIVE-25665 >>> [1] >>>>>> but >>>>>>>>>> we >>>>>>>>>>>>> can continue the discussion under that ticket and re-evaluate if >>>>>>>>>>>> necessary, >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Stamatis >>>>>>>>>>>>> >>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/HIVE-25665 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich <k...@rxd.hu> >>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hey All, >>>>>>>>>>>>>> >>>>>>>>>>>>>> We didn't made a release for a long time now; (3.1.2 was >>> released >>>>>> on >>>>>>>>>> 26 >>>>>>>>>>>>>> August 2019) - and I think because we didn't made that many >>>>>> branch-3 >>>>>>>>>>>>>> releases; not too many fixes >>>>>>>>>>>>>> were ported there - which made that release branch kinda erode >>>>>> away. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We have a lot of new features/changes in the current master. >>>>>>>>>>>>>> I think instead of aiming for big feature-packed releases we >>>>>> should >>>>>>>>>> aim >>>>>>>>>>>>>> for making a regular release every few months - we should make >>>>>>>>>>>>>> regular >>>>>>>>>>>>>> releases which people could >>>>>>>>>>>>>> install and use. >>>>>>>>>>>>>> After all releasing Hive after more than 2 years would be big >>> step >>>>>>>>>>>>> forward >>>>>>>>>>>>>> in itself alone - we have so many improvements that I can't >>> even >>>>>>>>>>>> count... >>>>>>>>>>>>>> >>>>>>>>>>>>>> But I may know not every aspects of the project / states of >>> some >>>>>>>>>>>> internal >>>>>>>>>>>>>> features - so I would like to ask you: >>>>>>>>>>>>>> What would be the bare minimum requirements before we could >>>>>> release >>>>>>>>>> the >>>>>>>>>>>>>> current master as Hive X? >>>>>>>>>>>>>> >>>>>>>>>>>>>> There are many nice-to-have-s like: >>>>>>>>>>>>>> * hadoop upgrade >>>>>>>>>>>>>> * jdk11 >>>>>>>>>>>>>> * remove HoS or MR >>>>>>>>>>>>>> * ? >>>>>>>>>>>>>> but I don't think these are blockers...we can make any of these >>>>>> in >>>>>>>>>>>>>> the >>>>>>>>>>>>>> next release if we start making them... >>>>>>>>>>>>>> >>>>>>>>>>>>>> cheers, >>>>>>>>>>>>>> Zoltan >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> >>>