Agree with Peter, creating JIRAs is the way to go. Putting the appropriate priority (e.g., BLOCKER) and version (4.0.0 or 4.0.0-alpha-1) when creating the JIRA should be enough to keep us on track. I am mentioning both 4.0.0 and 4.0.0-alpha-1 because eventually I think we are gonna move everything with target 4.0.0 to 4.0.0-alpha-1.
On Wed, Mar 2, 2022 at 9:37 AM Peter Vary <pv...@cloudera.com.invalid> wrote: > Hi Team, > > Could we create tickets for the issues? > I think it would be good to collect the issues/potential blockers in the > jira instead of having a complicated mail thread. > > If we set the target version to 4.0.0-alpha-1, then we can easily use the > following filter to see the status of the tasks: > > https://issues.apache.org/jira/issues/?jql=project%3D%22HIVE%22%20AND%20%22Target%20Version%2Fs%22%3D%224.0.0-alpha-1%22 > < > https://issues.apache.org/jira/issues/?jql=project=%22HIVE%22%20AND%20%22Target%20Version/s%22=%224.0.0-alpha-1%22 > > > > @Stamatis: Sadly I have missed your letter/jira and created my own with > the fix for building from the src package: > https://issues.apache.org/jira/browse/HIVE-25997 < > https://issues.apache.org/jira/browse/HIVE-25997> > If you have time, I would like to ask you to review. > > If anyone knows of any blocker I would like to ask them to create a jira > for that too. > > Thanks, > Peter > > > > On 2022. Mar 2., at 7:04, Sungwoo Park <c...@pl.postech.ac.kr> wrote: > > > > Hello Alessandro, > > > > For the latest commit, loading ORC tables fails (with the log message > shown below). Let me try to find a commit that introduces this bug and > create a JIRA ticket. > > > > --- Sungwoo > > > > 2022-03-02 05:41:56,578 ERROR [Thread-73] exec.StatsTask: Failed to run > stats task > > java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001 > > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:622) > > at > org.apache.hadoop.hive.ql.stats.ColStatsProcessor.constructColumnStatsFromPackedRows(ColStatsProcessor.java:105) > > at > org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:200) > > at > org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:93) > > at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107) > > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212) > > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) > > at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:83) > > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path > does not exist: > hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001 > > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) > > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) > > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) > > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) > > at > org.apache.hadoop.hive.ql.exec.FetchOperator.generateWrappedSplits(FetchOperator.java:435) > > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:402) > > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:306) > > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560) > > ... 7 more > > > > On Tue, 1 Mar 2022, Alessandro Solimando wrote: > > > >> Hi Sungwoo, > >> last time I tried to run TPCDS-based benchmark I stumbled upon a similar > >> situation, finally I found that statistics were not computed, so CBO was > >> not kicking in, and the automatic retry goes with CBO off which was > failing > >> for something like 10 queries (subqueries cannot be decorrelated, but > also > >> some runtime errors). > >> > >> Making sure that (column) statistics were correctly computed fixed the > >> problem. > >> > >> Can you check if this is the case for you? > >> > >> HTH, > >> Alessandro > >> > >> On Tue, 1 Mar 2022 at 15:28, POSTECH CT <c...@pl.postech.ac.kr> wrote: > >> > >>> Hello Hive team, > >>> > >>> I wonder if anyone in the Hive team has tried the TPC-DS benchmark on > >>> the master branch recently. We occasionally run TPC-DS system tests > >>> using the master branch, and the tests don't succeed completely. Here > >>> is how our TPC-DS tests proceed. > >>> > >>> 1. Compile and run Hive on Tez (not Hive-LLAP) > >>> 2. Load ORC tables from 1TB TPC-DS raw text data, and compute > statistics > >>> 3. Run 99 TPC-DS queries which were slightly modified to return > >>> varying number of rows (rather than 100 rows) > >>> 4. Compare the results against the previous results > >>> > >>> The previous results were obtained and cross-checked by running Hive > >>> 3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their > >>> correctness. > >>> > >>> For the latest commit in the master branch, step 2 fails. For earlier > >>> commits (for example, commits in February 2021), step 3 fails where > >>> several queries either fail or return wrong results. > >>> > >>> We can compile and report the test results in this mailing list, but > >>> would like to know if similar results have been reproduced by the Hive > >>> team, in order to make sure that we did not make errors in our tests. > >>> > >>> If it is okay to open a JIRA ticket that only reports failures in the > >>> TPC-DS test, we could also perform git bi-sect to locate the commit > >>> that begin to generate wrong results. > >>> > >>> --- Sungwoo Park > >>> > >>> On Tue, 1 Mar 2022, Zoltan Haindrich wrote: > >>> > >>>> Hey, > >>>> > >>>> Great to hear that we are on the same side regarding these things :) > >>>> > >>>> For around a week now - we have nightly builds for the master branch: > >>>> http://ci.hive.apache.org/job/hive-nightly/12/ > >>>> > >>>> I think we have 1 blocker issue: > >>>> https://issues.apache.org/jira/browse/HIVE-25665 > >>>> > >>>> I know about one more thing I would rather get fixed before we release > >>> it: > >>>> https://issues.apache.org/jira/browse/HIVE-25994 > >>>> The best would be to introduce smoke tests (HIVE-22302) to ensure that > >>>> something like this will not happen in the future - but we should > >>> probably > >>>> start moving forward. > >>>> > >>>> I think we could call the first iteration of this as "4.0.0-alpha-1" > :) > >>>> > >>>> I've added 4.0.0-alpha-1 as a version - and added the above two ticket > >>> to it. > >>>> > >>> > https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%204.0.0-alpha-1 > >>>> > >>>> Are there any more things you guys know which would be needed? > >>>> > >>>> cheers, > >>>> Zoltan > >>>> > >>>> > >>>> On 2/22/22 12:18 PM, Peter Vary wrote: > >>>>> I would vote for 4.0.0-alpha-1 or similar for all of the components. > >>>>> > >>>>> When we have more stable releases I would keep the 4.x.x schema, > since > >>>>> everyone is familiar with it, and I do not see a really good reason > to > >>>>> change it. > >>>>> > >>>>> Thanks, > >>>>> Peter > >>>>> > >>>>> > >>>>>> On 2022. Feb 10., at 3:34, Szehon Ho <szehon.apa...@gmail.com> > wrote: > >>>>>> > >>>>>> +1 that would be awesome to see Hive master released after so long. > >>>>>> > >>>>>> Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would > >>> pick > >>>>>> any 3.x or calendar date (which could tend to slip and be more > >>>>>> confusing?). > >>>>>> > >>>>>> Thanks in any case to get the ball rolling. > >>>>>> Szehon > >>>>>> > >>>>>> On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich <k...@rxd.hu> > wrote: > >>>>>> > >>>>>>> Hey, > >>>>>>> > >>>>>>> Thank you guys for chiming in; versioning is for sure something we > >>> should > >>>>>>> get to some common ground. > >>>>>>> Its a triple problem right now; I think we have the following > things: > >>>>>>> * storage-api > >>>>>>> ** we have "2.7.3-SNAPSHOT" in the repo > >>>>>>> *** > >>>>>>> > >>> > https://github.com/apache/hive/blob/0d1cffffc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27 > >>>>>>> ** meanwhile we already have 2.8.1 released to maven central > >>>>>>> *** > >>> https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api > >>>>>>> * standalone-metastore > >>>>>>> ** 4.0.0-SNAPSHOT in the repo > >>>>>>> ** last release is 3.1.2 > >>>>>>> * hive > >>>>>>> ** 4.0.0-SNAPSHOT in the repo > >>>>>>> ** last release is 3.1.2 > >>>>>>> > >>>>>>> Regarding the actual version number I'm not entirely sure where we > >>> should > >>>>>>> start the numbering - that's why I was referring to it as Hive-X > in my > >>>>>>> first letter. > >>>>>>> > >>>>>>> I think the key point here would be to start shipping releases > >>> regularily > >>>>>>> and not the actual version number we will use - I'll kinda open to > any > >>>>>>> versioning scheme which > >>>>>>> reflects that this is a newer release than 3.1.2. > >>>>>>> > >>>>>>> I could imagine the following ones: > >>>>>>> (A) start with something less expected; but keep 3 in the prefix to > >>>>>>> reflect that this is not yet 4.0 > >>>>>>> I can imagine the following numbers: > >>>>>>> 3.900.0, 3.901.0, ... > >>>>>>> 3.9.0, 3.9.1, ... > >>>>>>> (B) start 4.0.0 > >>>>>>> 4.0.0, 4.1.0, ... > >>>>>>> (C) jump to some calendar based version number like 2022.2.9 > >>>>>>> trunk based development has pros and cons...making a move like > >>> this > >>>>>>> irreversibly pledges trunk based development; and makes release > >>> branches > >>>>>>> hard to introduce > >>>>>>> (X) somewhat orthogonal is to (also) use some suffixes > >>>>>>> 4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1 > >>>>>>> this is probably the most tempting to use - but this versioning > >>>>>>> schema with a non-changing MINOR and PATCH number will > >>>>>>> also suggest that the actual software is fully compatible - and > >>> only > >>>>>>> bugs are being fixed - which will not be true... > >>>>>>> > >>>>>>> I really like the idea to suffix these releases with alpha or beta > - > >>>>>>> which > >>>>>>> will communicate our level commitment that these are not 100% > >>> production > >>>>>>> ready artifacts. > >>>>>>> > >>>>>>> I think we could fix HIVE-25665; and probably experiment with > >>>>>>> 4.0.0-alpha1 > >>>>>>> for start... > >>>>>>> > >>>>>>>> This also means there should *not* be a branch-4 after releasing > Hive > >>>>>>> 4.0 > >>>>>>>> and let that diverge (and becomes the next, super-ignored > branch-3), > >>>>>>> correct; no need to keep a branch we don't maintain...but in any > case > >>> I > >>>>>>> think we can postpone this decision until there will be something > to > >>>>>>> release... :) > >>>>>>> > >>>>>>> cheers, > >>>>>>> Zoltan > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 2/9/22 10:23 AM, L?szl? Bodor wrote: > >>>>>>>> Hi All! > >>>>>>>> > >>>>>>>> A purely technical question: what will the SNAPSHOT version become > >>> after > >>>>>>>> releasing Hive 4.0.0? I think this is important, as it defines and > >>>>>>> reflects > >>>>>>>> the future release plans. > >>>>>>>> > >>>>>>>> Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 + > >>> branch-3. > >>>>>>>> Hive is an evolving and super-active project: if we want to make > >>> regular > >>>>>>>> releases, we should simply release Hive 4.0 and bump pom to > >>>>>>> 4.1.0-SNAPSHOT, > >>>>>>>> which clearly says that we can release Hive 4.1 anytime we want, > >>> without > >>>>>>>> being frustrated about "whether we included enough cool stuff to > >>> release > >>>>>>>> 5.0". > >>>>>>>> > >>>>>>>> This also means there should *not* be a branch-4 after releasing > >>> Hive > >>>>>>>> 4.0 > >>>>>>>> and let that diverge (and becomes the next, super-ignored > branch-3), > >>>>>>>> only > >>>>>>>> when we end up bringing a minor backward-incompatible thing that > >>> needs a > >>>>>>>> 4.0.x, and when it happens, we'll create *branch-4.0 *on demand. > For > >>> me, > >>>>>>> a > >>>>>>>> branch called *branch-4.0* doesn't imply either I can expect cool > >>>>>>> releases > >>>>>>>> in the future from there or the branch is maintained and tries to > be > >>> in > >>>>>>>> sync with the *master*. > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Laszlo Bodor > >>>>>>>> > >>>>>>>> Alessandro Solimando <alessandro.solima...@gmail.com> ezt ?rta > >>> (id?pont: > >>>>>>>> 2022. febr. 8., K, 16:42): > >>>>>>>> > >>>>>>>>> Hello everyone, > >>>>>>>>> thank you for starting this discussion. > >>>>>>>>> > >>>>>>>>> I agree that releasing the master branch regularly and > sufficiently > >>>>>>> often > >>>>>>>>> is welcome and vital for the health of the community. > >>>>>>>>> > >>>>>>>>> It would be great to hear from others too, especially PMC members > >>> and > >>>>>>>>> committers, but even simple contributors/followers as myself. > >>>>>>>>> > >>>>>>>>> Best regards, > >>>>>>>>> Alessandro > >>>>>>>>> > >>>>>>>>> On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis < > zabe...@gmail.com > >>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hello, > >>>>>>>>>> > >>>>>>>>>> Thanks for starting the discussion Zoltan. > >>>>>>>>>> > >>>>>>>>>> I strongly believe that it is important to have regular and > often > >>>>>>>>> releases > >>>>>>>>>> otherwise people will create and maintain separate Hive forks. > >>>>>>>>>> The latter is not good for the project and the community may > lose > >>>>>>>>> valuable > >>>>>>>>>> members because of it. > >>>>>>>>>> > >>>>>>>>>> Going forward I fully agree that there is no point bringing up > >>> strong > >>>>>>>>>> blockers for the next release. For sure there are many backward > >>>>>>>>>> incompatible changes and possibly unstable features but unless > we > >>> get > >>>>>>>>>> a > >>>>>>>>>> release out it will be difficult to determine what is broken and > >>> what > >>>>>>>>> needs > >>>>>>>>>> to be fixed. > >>>>>>>>>> > >>>>>>>>>> Due to the big number of changes that are going to appear in the > >>> next > >>>>>>>>>> version I would suggest using the terms Hive X-alpha, Hive > X-beta > >>> for > >>>>>>> the > >>>>>>>>>> first few releases. This will make it clear to the end users > that > >>> they > >>>>>>>>> need > >>>>>>>>>> to be careful when upgrading from an older version and it will > >>> give us > >>>>>>> a > >>>>>>>>>> bit more time and freedom to treat issues that the users will > >>> likely > >>>>>>>>>> discover. > >>>>>>>>>> > >>>>>>>>>> The only real blocker that we may want to treat is HIVE-25665 > [1] > >>> but > >>>>>>> we > >>>>>>>>>> can continue the discussion under that ticket and re-evaluate if > >>>>>>>>> necessary, > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Stamatis > >>>>>>>>>> > >>>>>>>>>> [1] https://issues.apache.org/jira/browse/HIVE-25665 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich <k...@rxd.hu> > >>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hey All, > >>>>>>>>>>> > >>>>>>>>>>> We didn't made a release for a long time now; (3.1.2 was > released > >>> on > >>>>>>> 26 > >>>>>>>>>>> August 2019) - and I think because we didn't made that many > >>> branch-3 > >>>>>>>>>>> releases; not too many fixes > >>>>>>>>>>> were ported there - which made that release branch kinda erode > >>> away. > >>>>>>>>>>> > >>>>>>>>>>> We have a lot of new features/changes in the current master. > >>>>>>>>>>> I think instead of aiming for big feature-packed releases we > >>> should > >>>>>>> aim > >>>>>>>>>>> for making a regular release every few months - we should make > >>>>>>>>>>> regular > >>>>>>>>>>> releases which people could > >>>>>>>>>>> install and use. > >>>>>>>>>>> After all releasing Hive after more than 2 years would be big > step > >>>>>>>>>> forward > >>>>>>>>>>> in itself alone - we have so many improvements that I can't > even > >>>>>>>>> count... > >>>>>>>>>>> > >>>>>>>>>>> But I may know not every aspects of the project / states of > some > >>>>>>>>> internal > >>>>>>>>>>> features - so I would like to ask you: > >>>>>>>>>>> What would be the bare minimum requirements before we could > >>> release > >>>>>>> the > >>>>>>>>>>> current master as Hive X? > >>>>>>>>>>> > >>>>>>>>>>> There are many nice-to-have-s like: > >>>>>>>>>>> * hadoop upgrade > >>>>>>>>>>> * jdk11 > >>>>>>>>>>> * remove HoS or MR > >>>>>>>>>>> * ? > >>>>>>>>>>> but I don't think these are blockers...we can make any of these > >>> in > >>>>>>>>>>> the > >>>>>>>>>>> next release if we start making them... > >>>>>>>>>>> > >>>>>>>>>>> cheers, > >>>>>>>>>>> Zoltan > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>>> > >>> > >> > >